ceng 3420 lecture 06 pipeline
play

CENG 3420 Lecture 06: Pipeline Bei Yu byu@cse.cuhk.edu.hk - PowerPoint PPT Presentation

CENG 3420 Lecture 06: Pipeline Bei Yu byu@cse.cuhk.edu.hk CENG3420 L06.1 Spring 2020 Outline q Pipeline Motivations q Pipeline Hazards q Exceptions q Background: Flip-Flop Control Signals CENG3420 L06.2 Spring 2020 Outline q Pipeline


  1. CENG 3420 Lecture 06: Pipeline Bei Yu byu@cse.cuhk.edu.hk CENG3420 L06.1 Spring 2020

  2. Outline q Pipeline Motivations q Pipeline Hazards q Exceptions q Background: Flip-Flop Control Signals CENG3420 L06.2 Spring 2020

  3. Outline q Pipeline Motivations q Pipeline Hazards q Exceptions q Background: Flip-Flop Control Signals CENG3420 L06.3 Spring 2020

  4. Review: Instruction Critical Paths q Calculate cycle time assuming negligible delays (for muxes, control unit, sign extend, PC access, shift left 2, wires) except: G Instruction and Data Memory (4 ns) G ALU and adders (2 ns) G Register File access (reads or writes) (1 ns) Instr. I Mem Reg Rd ALU Op D Mem Reg Wr Total R- 4 1 2 1 8 type 4 1 2 4 1 12 load 4 1 2 4 11 store 4 1 2 7 beq jump 4 4 CENG3420 L06.4 Spring 2020

  5. Review: Single Cycle Disadvantages & Advantages q Uses the clock cycle inefficiently – the clock cycle must be timed to accommodate the slowest instr G especially problematic for more complex instructions like floating point multiply Cycle 1 Cycle 2 Clk lw sw Waste q May be wasteful of area since some functional units (e.g., adders) must be duplicated since they can not be shared during a clock cycle but q It is simple and easy to understand CENG3420 L06.5 Spring 2020

  6. How Can We Make It Faster? q Start fetching and executing the next instruction before the current one has completed G Pipelining – (all?) modern processors are pipelined for performance G Remember the performance equation: CPU time = CPI * CC * IC q Under ideal conditions and with a large number of instructions, the speedup from pipelining is approximately equal to the number of pipe stages G A five stage pipeline is nearly five times faster because the CC is “nearly” five times faster q Fetch (and execute) more than one instruction at a time G Superscalar processing – stay tuned CENG3420 L06.6 Spring 2020

  7. The Five Stages of Load Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 IFetch Dec Exec Mem WB lw q IFetch: Instruction Fetch and Update PC q Dec: Registers Fetch and Instruction Decode q Exec: Execute R-type; calculate memory address q Mem: Read/write the data from/to the Data Memory q WB: Write the result data into the register file CENG3420 L06.7 Spring 2020

  8. A Pipelined MIPS Processor q Start the next instruction before the current one has completed G improves throughput - total amount of work done in a given time G instruction latency (execution time, delay time, response time - time from the start of an instruction to its completion) is not reduced Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 lw IFetch Dec Exec Mem WB IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem WB R-type - clock cycle (pipeline stage time) is limited by the slowest stage - for some stages don’t need the whole clock cycle (e.g., WB) - for some instructions, some stages are wasted cycles (i.e., nothing is done during that cycle for that instruction) CENG3420 L06.8 Spring 2020

  9. Single Cycle versus Pipeline Single Cycle Implementation (CC = 800 ps): Cycle 1 Cycle 2 Clk lw sw Waste 400 ps Pipeline Implementation (CC = 200 ps): lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem WB IFetch Dec Exec Mem WB R-type q To complete an entire instruction in the pipelined case takes 1000 ps (as compared to 800 ps for the single cycle case). Why ? q How long does each take to complete 1,000,000 adds ? CENG3420 L06.9 Spring 2020

  10. Pipelining the MIPS ISA q What makes it easy G all instructions are the same length (32 bits) - can fetch in the 1 st stage and decode in the 2 nd stage G few instruction formats (three) with symmetry across formats - can begin reading register file in 2 nd stage G memory operations occur only in loads and stores - can use the execute stage to calculate memory addresses G each instruction writes at most one result (i.e., changes the machine state) and does it in the last few pipeline stages (MEM or WB) G operands must be aligned in memory so a single data transfer takes only one data memory access CENG3420 L06.10 Spring 2020

  11. MIPS Pipeline Datapath Additions/Mods q State registers between each pipeline stage to isolate them IF:IFetch ID:Dec EX:Execute MEM: WB: MemAccess WriteBack IF/ID ID/EX EX/MEM Add Add MEM/WB Shift 4 left 2 Read Addr 1 Instruction Data Register Read Memory Memory Data 1 Read Addr 2 Read PC File Read Address Address ALU Write Addr Data Read Data 2 Write Data Write Data Sign Extend 16 32 System Clock CENG3420 L06.11 Spring 2020

  12. MIPS Pipeline Control Path Modifications q All control signals can be determined during Decode G and held in the state registers between pipeline stages PCSrc ID/EX EX/MEM Control IF/ID Add MEM/WB Branch Add RegWrite Shift 4 left 2 Read Addr 1 Instruction Data Register Read Memory Memory Data 1 MemtoReg Read Addr 2 ALUSrc Read PC File Read Address Address ALU Write Addr Data Read Data 2 Write Data Write Data ALU cntrl MemRead Sign Extend 16 32 ALUOp RegDst CENG3420 L06.12 Spring 2020

  13. Pipeline Control q IF Stage: read Instr Memory (always asserted) and write PC (on System Clock) q ID Stage: no optional control signals to set EX Stage MEM Stage WB Stage Reg ALU ALU ALU Brch Mem Mem Reg Mem Dst Op1 Op0 Src Read Write Write toReg R 1 1 0 0 0 0 0 1 0 lw 0 0 0 1 0 1 0 1 1 X 0 0 1 0 0 1 0 X sw beq X 0 1 0 1 0 0 0 X CENG3420 L06.13 Spring 2020

  14. Graphically Representing MIPS Pipeline ALU IM Reg DM Reg q Can help with answering questions like: G How many cycles does it take to execute this code? G What is the ALU doing during cycle 4? G Is there a hazard, why does it occur, and how can it be fixed? CENG3420 L06.14 Spring 2020

  15. Other Pipeline Structures Are Possible q What about the (slow) multiply operation? G Make the clock twice as slow or … G let it take two cycles (since it doesn’t use the DM stage) MUL ALU IM Reg DM Reg q What if the data memory access is twice as slow as the instruction memory? G make the clock twice as slow or … G let data memory access take two cycles (and keep the same clock rate) ALU IM DM1 DM2 Reg Reg CENG3420 L06.15 Spring 2020

  16. Other Sample Pipeline Alternatives q ARM7 IM Reg EX PC update decode ALU op IM access reg DM access access shift/rotate commit result (write back) q XScale Reg ALU IM1 DM1 IM2 Reg SHFT DM2 PC update decode DM write ALU op BTB access reg 1 access reg write start IM access start DM access shift/rotate IM access exception reg 2 access CENG3420 L06.16 Spring 2020

  17. Why Pipeline? For Performance! Time (clock cycles) Once the Inst 0 ALU pipeline is full, IM Reg DM Reg I one instruction n is completed s Inst 1 ALU IM Reg DM Reg t every cycle, so r. CPI = 1 ALU Inst 2 IM Reg DM Reg O r d ALU Inst 3 IM Reg DM Reg e r ALU Inst 4 IM Reg DM Reg Time to fill the pipeline CENG3420 L06.17 Spring 2020

  18. Outline q Pipeline Motivations q Pipeline Hazards q Exceptions q Background: Flip-Flop Control Signals CENG3420 L06.18 Spring 2020

  19. Can Pipelining Get Us Into Trouble? q Yes: Pipeline Hazards G structural hazards: - a required resource is busy G data hazards: - attempt to use data before it is ready G control hazards: - deciding on control action depends on previous instruction q Can usually resolve hazards by waiting G pipeline control must detect the hazard G and take action to resolve hazards CENG3420 L06.19 Spring 2020

  20. Structure Hazards q Conflict for use of a resource q In MIPS pipeline with a single memory G Load/store requires data access G Instruction fetch requires instruction access q Hence, pipeline datapaths require separate instruction/data memories G Or separate instruction/data caches q Since Register File CENG3420 L06.20 Spring 2020

  21. Resolve Structural Hazard 1 Time (clock cycles) Reading data from lw ALU Mem Reg Mem Reg memory I n s Inst 1 ALU Mem Reg Mem Reg t r. ALU Inst 2 Mem Reg Mem Reg O r d ALU Inst 3 Mem Reg Mem Reg e r ALU Inst 4 Mem Reg Mem Reg Reading instruction from memory q Fix with separate instr and data memories (I$ and D$) CENG3420 L06.21 Spring 2020

  22. Resolve Structural Hazard 2 Time (clock cycles) Fix register file add $1, ALU access hazard by IM Reg DM Reg I doing reads in the n second half of the s Inst 1 ALU IM Reg DM Reg t cycle and writes in r. the first half ALU Inst 2 IM Reg DM Reg O r d ALU add $2,$1, IM Reg DM Reg e r clock edge that controls clock edge that controls loading of pipeline state register writing registers CENG3420 L06.22 Spring 2020

  23. Data Hazards: Register Usage q Dependencies backward in time cause hazards ALU add $1, IM DM Reg Reg ALU sub $4,$1,$5 IM DM Reg Reg ALU and $6,$1,$7 IM DM Reg Reg ALU or $8,$1,$9 IM DM Reg Reg ALU IM DM Reg Reg xor $4,$1,$5 q Read before write data hazard CENG3420 L06.24 Spring 2020

  24. Data Hazards: Load Memory q Dependencies backward in time cause hazards ALU lw $1,4($2) IM DM Reg Reg I n s ALU sub $4,$1,$5 IM DM Reg Reg t r. ALU and $6,$1,$7 IM DM Reg Reg O r d ALU or $8,$1,$9 IM DM Reg Reg e r ALU IM DM Reg Reg xor $4,$1,$5 q Load-use data hazard CENG3420 L06.25 Spring 2020

Recommend


More recommend