6a.1 6a.2 Pipelining Introduction • Consider a drink bottling plant – Filling the bottle = 3 sec. – Placing the cap = 3 sec. EE 457 Unit 6a – Labeling = 3 sec. • Would you want… – Machine 1 = Does all three (9 secs.), outputs the bottle, repeats… – Machine 2 = Divided into three parts (one for each step) passing Basic Pipelining Techniques bottles between them • Machine _____ offers ability to __________ Place Filler + Capper + Label Filler Labeler Cap (3 + 3 + 3) (3 sec) (3 sec) (3 sec) 6a.3 6a.4 More Pipelining Examples Summing Elements • Consider adding an array of 4-bit numbers: • Car Assembly Line – Z[i] = A[i] + B[i] • Wash/Dry/Fold – Delay: 10ns Mem. Access (read or write), 10 ns each FA – Clock cycle time = _____________________________________ – Would you buy a combo washer + dryer unit that AMEM does both operations in the same tank?? addr BMEM addr i i • Freshman/Sophomore/Junior/Senior data data A[3:0] B[3:0] A3 B3 A2 B2 A1 B1 A0 B0 X Y X Y X Y X Y 5 ns Co FA Co FA Co FA Co FA 0 Ci Ci Ci Ci S S S S ZMEM addr i Z3 Z2 Z1 Z0 data Z[3:0]
6a.5 6a.6 Pipelined Adder Pipelining Effects on Clock Period 5 ns 15 ns AMEM BMEM • Rather than just try to balance addr i i addr A[3:0] B[3:0] delay we could consider making data 10ns data A3 B3 A2 B2 A1 B1 A0 B0 more stages Pipeline Register S1/S2 A3 B3 A2 B2 A1 B1 A0 B0 Ex. 1: Unbalanced stage delay (Stage Latch) – Divide long stage into multiple stages Clock Period = ___ns X Y FA S Ci 0 – In Example 3, clock period could be Co 10 ns 10 ns If we assume that _________________ S2/S3 C1 S0 the pipeline X Y – Time through the pipeline (latency) is FA S Ci registers are ideal Co still 20 ns, but we’ve doubled our (0ns additional S3/S4 C2 S1 throughput (1 result every __ ns rather Ex. 2: Balanced stage delay delay) we can clock Clock Period = __ns (____ speedup) X Y than every 10 or 15 ns) the pipe every __ FA S Ci Co ns. Speedup = – Note: There is a small time overhead 5 ns 5 ns 5 ns 5 ns S4/S5 C3 S2 ____ to adding a pipeline register/stage (i.e. X Y can’t go crazy adding stages) FA S Ci Co ZMEM S5/S6 C4 S3 S2 i addr Ex. 3: Break long stage into multiple stages Z[3:0] data Clock period = ___(___ speedup) 10ns Z3 Z2 Z1 Z0 6a.7 6a.8 To Register or Latch? But Can We Latch? • We can latch if we run the latches on opposite phases of the • What should we use for pipeline stages clock or have a so-called _________________ – Registers [edge-sensitive] …or… – Because each latch runs on the opposite phase data can only move – Latches [level-sensitive] one step before being stopped by a latch that is in hold (off) mode • Latches may allow data to _________________ • You may learn more about this in EE577a or EE560 (a technique known as Slack Borrowing & Time Stealing) • Answer: __________________ Φ Register or Latch Register or Latch Latch Latch Latch Latch S1b S2a S2b S3a S3b S1 S2 S3 ~ Φ
6a.9 6a.10 Pipelining Introduction Non-Pipelined Execution Instruction Fetch Reg. ALU Op. Data Reg. Total Time (I-MEM) Read Mem Write • Implementation technique that _________execution of multiple instructions at once Load 10 ns 5 ns 10 ns 10 ns 5 ns 40 ns Store • Improves ___________ rather an single-instruction execution latency R-Type • ______________ stage determines clock cycle time [e.g. a 30 min. wash cycle but 1 hour dry time means _________ per load] Branch • In the case of perfectly balanced stages: – Speedup = Jump • A 5-stage pipelined CPU may not realize this speedup 5x b/c… time – The stages may not be perfectly balanced 40 ns – The overhead of filling up the pipe initially Fetch Reg ALU Data Reg LW $5,100($2) – The overhead (setup time and clock-to-Q) delay of the stage registers 40 ns – Inability to keep the pipe full due to branches & data hazards LW $7,40($6) Fetch Reg ALU Data Reg 40 ns LW $8,24($6) Fetch … 3 Instructions = 3*40 ns 6a.11 6a.12 Pipelined Timing Pipelined Execution time 40 ns 60 ns 70 ns 80 ns • Execute n instructions using a k Fetch Decode Exec. Mem. WB 10 ns 20 ns 30 ns 50 ns 10ns 10ns 10ns 10ns 10ns stage datapath Pipeline Filling C1 ADD LW $5,100($2) Fetch Reg ALU Data Reg – i.e. Multicycle CPU w/ k steps C2 SUB ADD or single cycle CPU w/ clock LW $7,40($6) Fetch Reg ALU Data Reg cycle k times slower C3 LW SUB ADD LW $8,24($6) Fetch Reg ALU Data Reg • w/o pipelining: ___________ C4 SW LW SUB ADD Pipeline Full … Fetch Reg ALU Data Reg C5 AND SW LW SUB ADD • w/ pipelining: ____________ C6 OR AND SW LW SUB – ___ cycles for 1 st instruc. + ____ • Notice that even though the register access only takes 5 ns it is allocated a C7 XOR OR AND SW LW cycles for n-1 instrucs. 10 ns slot in the pipeline Pipeline Emptying – Assumes we keep the pipeline C8 XOR OR AND SW • Total time for these 3 pipelined instructions = full 70 ns = ___ ns for 1 st instruc + _____ for the remaining instructions to complete C9 XOR OR AND – • The speedup looks like it is only 120 ns / 70 ns = 1.7x C10 XOR OR • But consider 1003 instructions: ____________________________ C11 XOR – The overhead of filling the pipeline is ___________ over steady-state execution when 7 Instrucs. = 11 clocks the pipeline is full
6a.13 6a.14 Single-Cycle CPU Datapath Designing the Pipelined Datapath Fetch (IF) Decode (ID) Exec. (EX) Mem WB • To pipeline a datapath in five stages means five 0 1 instructions will be executing (“in-flight”) during any + Sh. 4 single clock cycle A Left + 2 PCSrc B • Resources cannot be shared between stages because Branch RegWrite [25:21] there may always be an instruction wanting to use Read Reg. 1 # 5 MemRead the resource [20:16] Read Reg. 2 # 5 Read 0 Addr. – Each stage needs its own resources 0 Write Zero data 1 [15:11] Reg. # PC ALU Instruc. 1 Res. Addr. Read 5 – The single-cycle CPU datapath also matches this concept of Write 0 data 2 I-Cache Read Data 1 1 RegDst Data no shared resources Register File [15:0] Write Data 16 32 – We can simply divide the single-cycle CPU into stages Sign ALUSrc MemtoReg Extend D-Cache INST[5:0] ALU control ALUOp[1:0] MemWrite 14 6a.15 6a.16 Information Flow in a Pipeline Register File • Don’t we have a non-linear flow when we write a value back • Data or control information should flow only in the to the register file? forward direction in a linear pipeline – An instruction in WB is re-using the register file in the ID stage – Non-linear pipelines where information is fed back into a – Actually we are utilizing different ________of the register file previous stage occurs in more complex pipelines such as • ID stage ___________ register values floating point dividers • WB stage __________ register value • The CPU pipeline is like a buffet line or cafeteria – Like a buffet line with _________ at one station where people can not try to revisit a a previous IM Reg Reg serving station without disrupting the smooth flow of ALU IM the line ??? ??? Buffet Line Buffet Line
6a.17 6a.18 Register File Pipelining the Fetch Phase • Only an issue if WB to same register as being read • Note that to keep the pipeline full we have to fetch a new instruction every • Register file can be designed to do “internal forwarding” Fetch clock cycle where the data being written is immediately ______ out as 4 A • Thus we have to perform the ___________________ + B CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 PC = PC + 4 every clock cycle Write $5 LW $5,100($2) Stage Register IM DM ALU Reg Reg • Thus there shall be no pipelining Addr. 0 PC registers in the datapath responsible Instruc. 1 IM DM I-Cache ALU for PC = PC +4 Reg Reg • Support for branch/jump warrants a IM DM ALU Reg Reg lengthy discussion which we will perform later ADD $3,$4,$5 IM DM Reg ALU Reg Read $5 6a.19 6a.20 Pipeline Packing List Basic 5 Stage Pipeline • Compute the size of each pipeline register (find the max. info needed for any instruction in each stage) • Just as when you go on a trip you have to pack • To simplify, just consider LW/SW (Ignore control signals) everything you need _____________, so in pipelines LW $10,40($1) Op = 35 rs=1 rt=10 immed.=40 you have to take all the control and data you will SW $15,100($2) Op = 43 rs=2 rt=15 immed.=100 need with you down the pipeline Fetch Decode Exec. Mem WB 4 A rs + Read + Sh. Reg. 1 # B 5 Left 2 rt Pipeline Stage Register Pipeline Stage Register Pipeline Stage Register Read Instruction Register Reg. 2 # 5 Read 0 rt/rd Addr. Write Zero data 1 0 Reg. # PC 5 ALU Instruc. 1 Res. Addr. Read Write 0 data 2 I-Cache Data Read 1 1 Data Register File Write Data Sign D-Cache Extend 16 32 Instruc = 32
Recommend
More recommend