pipelining introduction
play

Pipelining Introduction Consider a drink bottling plant Filling the - PowerPoint PPT Presentation

6a.1 6a.2 Pipelining Introduction Consider a drink bottling plant Filling the bottle = 3 sec. Placing the cap = 3 sec. EE 457 Unit 6a Labeling = 3 sec. Would you want Machine 1 = Does all three (9 secs.), outputs the


  1. 6a.1 6a.2 Pipelining Introduction • Consider a drink bottling plant – Filling the bottle = 3 sec. – Placing the cap = 3 sec. EE 457 Unit 6a – Labeling = 3 sec. • Would you want… – Machine 1 = Does all three (9 secs.), outputs the bottle, repeats… – Machine 2 = Divided into three parts (one for each step) passing Basic Pipelining Techniques bottles between them • Machine _____ offers ability to __________ Place Filler + Capper + Label Filler Labeler Cap (3 + 3 + 3) (3 sec) (3 sec) (3 sec) 6a.3 6a.4 More Pipelining Examples Summing Elements • Consider adding an array of 4-bit numbers: • Car Assembly Line – Z[i] = A[i] + B[i] • Wash/Dry/Fold – Delay: 10ns Mem. Access (read or write), 10 ns each FA – Clock cycle time = _____________________________________ – Would you buy a combo washer + dryer unit that AMEM does both operations in the same tank?? addr BMEM addr i i • Freshman/Sophomore/Junior/Senior data data A[3:0] B[3:0] A3 B3 A2 B2 A1 B1 A0 B0 X Y X Y X Y X Y 5 ns Co FA Co FA Co FA Co FA 0 Ci Ci Ci Ci S S S S ZMEM addr i Z3 Z2 Z1 Z0 data Z[3:0]

  2. 6a.5 6a.6 Pipelined Adder Pipelining Effects on Clock Period 5 ns 15 ns AMEM BMEM • Rather than just try to balance addr i i addr A[3:0] B[3:0] delay we could consider making data 10ns data A3 B3 A2 B2 A1 B1 A0 B0 more stages Pipeline Register S1/S2 A3 B3 A2 B2 A1 B1 A0 B0 Ex. 1: Unbalanced stage delay (Stage Latch) – Divide long stage into multiple stages Clock Period = ___ns X Y FA S Ci 0 – In Example 3, clock period could be Co 10 ns 10 ns If we assume that _________________ S2/S3 C1 S0 the pipeline X Y – Time through the pipeline (latency) is FA S Ci registers are ideal Co still 20 ns, but we’ve doubled our (0ns additional S3/S4 C2 S1 throughput (1 result every __ ns rather Ex. 2: Balanced stage delay delay) we can clock Clock Period = __ns (____ speedup) X Y than every 10 or 15 ns) the pipe every __ FA S Ci Co ns. Speedup = – Note: There is a small time overhead 5 ns 5 ns 5 ns 5 ns S4/S5 C3 S2 ____ to adding a pipeline register/stage (i.e. X Y can’t go crazy adding stages) FA S Ci Co ZMEM S5/S6 C4 S3 S2 i addr Ex. 3: Break long stage into multiple stages Z[3:0] data Clock period = ___(___ speedup) 10ns Z3 Z2 Z1 Z0 6a.7 6a.8 To Register or Latch? But Can We Latch? • We can latch if we run the latches on opposite phases of the • What should we use for pipeline stages clock or have a so-called _________________ – Registers [edge-sensitive] …or… – Because each latch runs on the opposite phase data can only move – Latches [level-sensitive] one step before being stopped by a latch that is in hold (off) mode • Latches may allow data to _________________ • You may learn more about this in EE577a or EE560 (a technique known as Slack Borrowing & Time Stealing) • Answer: __________________ Φ Register or Latch Register or Latch Latch Latch Latch Latch S1b S2a S2b S3a S3b S1 S2 S3 ~ Φ

  3. 6a.9 6a.10 Pipelining Introduction Non-Pipelined Execution Instruction Fetch Reg. ALU Op. Data Reg. Total Time (I-MEM) Read Mem Write • Implementation technique that _________execution of multiple instructions at once Load 10 ns 5 ns 10 ns 10 ns 5 ns 40 ns Store • Improves ___________ rather an single-instruction execution latency R-Type • ______________ stage determines clock cycle time [e.g. a 30 min. wash cycle but 1 hour dry time means _________ per load] Branch • In the case of perfectly balanced stages: – Speedup = Jump • A 5-stage pipelined CPU may not realize this speedup 5x b/c… time – The stages may not be perfectly balanced 40 ns – The overhead of filling up the pipe initially Fetch Reg ALU Data Reg LW $5,100($2) – The overhead (setup time and clock-to-Q) delay of the stage registers 40 ns – Inability to keep the pipe full due to branches & data hazards LW $7,40($6) Fetch Reg ALU Data Reg 40 ns LW $8,24($6) Fetch … 3 Instructions = 3*40 ns 6a.11 6a.12 Pipelined Timing Pipelined Execution time 40 ns 60 ns 70 ns 80 ns • Execute n instructions using a k Fetch Decode Exec. Mem. WB 10 ns 20 ns 30 ns 50 ns 10ns 10ns 10ns 10ns 10ns stage datapath Pipeline Filling C1 ADD LW $5,100($2) Fetch Reg ALU Data Reg – i.e. Multicycle CPU w/ k steps C2 SUB ADD or single cycle CPU w/ clock LW $7,40($6) Fetch Reg ALU Data Reg cycle k times slower C3 LW SUB ADD LW $8,24($6) Fetch Reg ALU Data Reg • w/o pipelining: ___________ C4 SW LW SUB ADD Pipeline Full … Fetch Reg ALU Data Reg C5 AND SW LW SUB ADD • w/ pipelining: ____________ C6 OR AND SW LW SUB – ___ cycles for 1 st instruc. + ____ • Notice that even though the register access only takes 5 ns it is allocated a C7 XOR OR AND SW LW cycles for n-1 instrucs. 10 ns slot in the pipeline Pipeline Emptying – Assumes we keep the pipeline C8 XOR OR AND SW • Total time for these 3 pipelined instructions = full 70 ns = ___ ns for 1 st instruc + _____ for the remaining instructions to complete C9 XOR OR AND – • The speedup looks like it is only 120 ns / 70 ns = 1.7x C10 XOR OR • But consider 1003 instructions: ____________________________ C11 XOR – The overhead of filling the pipeline is ___________ over steady-state execution when 7 Instrucs. = 11 clocks the pipeline is full

  4. 6a.13 6a.14 Single-Cycle CPU Datapath Designing the Pipelined Datapath Fetch (IF) Decode (ID) Exec. (EX) Mem WB • To pipeline a datapath in five stages means five 0 1 instructions will be executing (“in-flight”) during any + Sh. 4 single clock cycle A Left + 2 PCSrc B • Resources cannot be shared between stages because Branch RegWrite [25:21] there may always be an instruction wanting to use Read Reg. 1 # 5 MemRead the resource [20:16] Read Reg. 2 # 5 Read 0 Addr. – Each stage needs its own resources 0 Write Zero data 1 [15:11] Reg. # PC ALU Instruc. 1 Res. Addr. Read 5 – The single-cycle CPU datapath also matches this concept of Write 0 data 2 I-Cache Read Data 1 1 RegDst Data no shared resources Register File [15:0] Write Data 16 32 – We can simply divide the single-cycle CPU into stages Sign ALUSrc MemtoReg Extend D-Cache INST[5:0] ALU control ALUOp[1:0] MemWrite 14 6a.15 6a.16 Information Flow in a Pipeline Register File • Don’t we have a non-linear flow when we write a value back • Data or control information should flow only in the to the register file? forward direction in a linear pipeline – An instruction in WB is re-using the register file in the ID stage – Non-linear pipelines where information is fed back into a – Actually we are utilizing different ________of the register file previous stage occurs in more complex pipelines such as • ID stage ___________ register values floating point dividers • WB stage __________ register value • The CPU pipeline is like a buffet line or cafeteria – Like a buffet line with _________ at one station where people can not try to revisit a a previous IM Reg Reg serving station without disrupting the smooth flow of ALU IM the line ??? ??? Buffet Line Buffet Line

  5. 6a.17 6a.18 Register File Pipelining the Fetch Phase • Only an issue if WB to same register as being read • Note that to keep the pipeline full we have to fetch a new instruction every • Register file can be designed to do “internal forwarding” Fetch clock cycle where the data being written is immediately ______ out as 4 A • Thus we have to perform the ___________________ + B CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 PC = PC + 4 every clock cycle Write $5 LW $5,100($2) Stage Register IM DM ALU Reg Reg • Thus there shall be no pipelining Addr. 0 PC registers in the datapath responsible Instruc. 1 IM DM I-Cache ALU for PC = PC +4 Reg Reg • Support for branch/jump warrants a IM DM ALU Reg Reg lengthy discussion which we will perform later ADD $3,$4,$5 IM DM Reg ALU Reg Read $5 6a.19 6a.20 Pipeline Packing List Basic 5 Stage Pipeline • Compute the size of each pipeline register (find the max. info needed for any instruction in each stage) • Just as when you go on a trip you have to pack • To simplify, just consider LW/SW (Ignore control signals) everything you need _____________, so in pipelines LW $10,40($1) Op = 35 rs=1 rt=10 immed.=40 you have to take all the control and data you will SW $15,100($2) Op = 43 rs=2 rt=15 immed.=100 need with you down the pipeline Fetch Decode Exec. Mem WB 4 A rs + Read + Sh. Reg. 1 # B 5 Left 2 rt Pipeline Stage Register Pipeline Stage Register Pipeline Stage Register Read Instruction Register Reg. 2 # 5 Read 0 rt/rd Addr. Write Zero data 1 0 Reg. # PC 5 ALU Instruc. 1 Res. Addr. Read Write 0 data 2 I-Cache Data Read 1 1 Data Register File Write Data Sign D-Cache Extend 16 32 Instruc = 32

Recommend


More recommend