Debate Review: ISA -a Critical Interface EECS 252 Graduate Computer • Extremely well defined abstraction Architecture software • Huge, quantitative base of usage data for real applications Prog. Lang. Compiler Operating Systems filtered through SOA compiler technology instruction set Lec 4 – Issues in Basic Pipelines (stalls, • Huge quantitative base of implementation costs and performance exceptions, branch prediction) hardware • Convergence trend – enable optimizations, support HLL, OS David Culler support, contain complexity • Properties of a good abstraction Electrical Engineering and Computer Sciences • Lots of marketing (ignores, – Lasts through many generations (portability) misuses, or selective use of University of California, Berkeley – Used in many different ways (generality) established data) – Provides convenient functionality to higher levels • Worse when we get beyond http://www.eecs.berkeley.edu/~culler – Permits an efficient implementation at lower scalar operations levels http://www-inst.eecs.berkeley.edu/~cs252 • Translation / Interpretation boundary becoming less sharp 1/27/2005 CS252S05 L4 Pipe Issues 2 Discussion Exercise Ordering Properties of basic inst. pipeline Time (clock cycles) • In terms of the ‘iron triangle” what are the Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 performance implications of condition-codes? I ALU Ifetch Reg DMem Reg Execution window n s t ALU Reg Ifetch Reg DMem r. ALU Ifetch Reg DMem Reg O r ALU Reg Ifetch Reg DMem d e r Issue Complete • Instructions issued in order • Operand fetch is stage 2 => operand fetched in order • Write back in stage 5 => no WAW, no WAR hazards • Common pipeline flow => operands complete in order • Stage changes only at “end of instruction” 1/27/2005 CS252S05 L4 Pipe Issues 3 1/27/2005 CS252S05 L4 Pipe Issues 4 Control Pipeline What does forwarding do? Time (clock cycles) nPC I ALU n add r1,r2,r3 Ifetch Reg DMem Reg mux Op A s Registers A-res t ALU sub r4,r1,r3 Ifetch Reg DMem Reg ALU MEM-res r. D-Mem Op B ALU Ifetch Reg DMem Reg mux O and r6,r1,r7 B-byp r d ALU Ifetch Reg DMem Reg brch or r8,r1,r9 e r Ifetch Reg ALU DMem Reg xor r10,r1,r4 IR mop wop fwd ctrl op aop Next PC wop I-fetch mop Rr PC wop dcd Rr • Destination register is a name for instr’s result + Ra kill kill Rb ? • Source registers are names for sources Rr ? imed • Forwarding logic builds data dependence | & - +4 (flow) graph for instructions in the execution PC3 PC1 nPC PC2 window x CS252S05 L4 Pipe Issues 5 CS252S05 L4 Pipe Issues 6 1/27/2005 1/27/2005 NOW Handout Page 1
Multicycle stages Historical Perspective: Microprogramming “macro-instructions” Datapath Stage User program Main plus Data ADD Memory SUB AND Nxt Pipeline Contr Reg . Pipeline Control Reg . . one of these is DATA mapped into a execution sequence of these unit CPU control memory Micro-sequencer control Stall Datapath control Writable-control store? Supported complex instructions a sequence of simple micro-inst (RTs) • Stage microsequencer spits micro-ops into the pipe Pipelined micro-instruction processing, but very limited view. Could not reorganize macroinstructions to enable pipelining 1/27/2005 CS252S05 L4 Pipe Issues 7 1/27/2005 CS252S05 L4 Pipe Issues 8 Branch prediction Typical “simple” Pipeline • Example: MIPS R4000 • Datapath parallelism only useful if you can keep it fed. integer unit • Easy to fetch multiple (consecutive) instructions ex per cycle FP/int Multiply – essentially speculating on sequential flow IF ID MEM WB • Jump: unconditional change of control flow m1 m2 m3 m4 m5 m6 m7 – Always taken FP adder • Branch: conditional change of control flow a1 a2 a3 a4 – Taken about 50% of the time FP/int divider – Backward: 30% x 80% taken Div (lat = 25, – Forward: 70% x 40% taken Init inv=25) 1/27/2005 CS252S05 L4 Pipe Issues 9 1/27/2005 CS252S05 L4 Pipe Issues 10 Case for Branch Prediction when A Big Idea for Today Issue N instructions per clock cycle • Reactive: past actions cause system to adapt use 1. Branches will arrive up to n times faster in an n - issue processor – do what you did before better – ex: caches 2. Amdahl’s Law => relative impact of the control – TCP windows stalls will be larger with the lower potential CPI – URL completion, ... in an n -issue processor • Proactive: uses past actions to predict future actions – optimize speculatively, anticipate what you are about to do – branch prediction conversely, need branch prediction to ‘see’ – long cache blocks potential parallelism – ??? CS252S05 L4 Pipe Issues 11 CS252S05 L4 Pipe Issues 12 1/27/2005 1/27/2005 NOW Handout Page 2
Branch Prediction Schemes Dynamic Branch Prediction • Performance = ƒ(accuracy, cost of misprediction) 0. Static Branch Prediction • Branch History Table: Lower bits of PC address index table of • 1-bit Branch-Prediction Buffer 1-bit values – Says whether or not branch taken last time • 2-bit Branch-Prediction Buffer – No address check » saves HW, but may not be right branch • Correlating Branch Prediction Buffer – If inst == BR, update table with outcome • Problem: in a loop, 1-bit BHT will cause 2 mispredictions • Tournament Branch Predictor – End of loop case, when it exits instead of looping as before • Branch Target Buffer – First time through loop on next time through code, when it predicts exit instead of looping • Integrated Instruction Fetch Units – avg is 9 iterations before exit – Only 80% accuracy even if loop 90% of the time • Return Address Predictors • Local history – This particular branch inst » Or one that maps into same lost PC 1/27/2005 CS252S05 L4 Pipe Issues 13 1/27/2005 CS252S05 L4 Pipe Issues 14 2-bit Dynamic Branch Prediction Consider 3 Scenarios (J. Smith, 1981) • 2-bit scheme where change prediction only if get • Branch for loop test misprediction twice: • Check for error or exception predictors T • Alternating taken / not-taken Global history NT – example? Predict Taken Predict Taken taken T T NT NT Predict Not • Your worst-case prediction scenario Predict Not T Taken Taken • Red: stop, not taken NT • Green: go, taken • How could HW predict “this loop will execute 3 • Adds hysteresis to decision making process times” using a simple mechanism? • Generalize to n-bit saturating counter 1/27/2005 CS252S05 L4 Pipe Issues 15 1/27/2005 CS252S05 L4 Pipe Issues 16 Correlating Branches Accuracy of Different Schemes (Figure 3.15, p. 206) 20% Idea: taken/not taken Branch address (4 bits) 18% of recently executed 4096 Entries 2-bit BHT 18% Frequency of Mispredictions branches is related to 16% Unlimited Entries 2-bit BHT 2-bits per branch behavior of next 1024 Entries (2,2) BHT 14% local predictors Frequency of Mispredictions branch (as well as the 12% history of that branch 11% behavior) 10% – Then behavior of recent Prediction Prediction 8% branches selects 6% 6% 6% between, say, 4 6% 5% 5% predictions of next 4% 4% branch, updating just that prediction 0% 2% 1% 1% • (2,2) predictor: 2-bit 0% 0% global, 2-bit local 2-bit recent global nasa7 matrix300 tomcatv doducd spice fpppp gcc espresso eqntott li branch history 4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2) (01 = not taken then taken) CS252S05 L4 Pipe Issues 17 CS252S05 L4 Pipe Issues 18 1/27/2005 1/27/2005 What’s missing in this picture? NOW Handout Page 3
Recommend
More recommend