ceng 3420 lecture 07 pipeline
play

CENG 3420 Lecture 07: Pipeline Bei Yu byu@cse.cuhk.edu.hk - PowerPoint PPT Presentation

CENG 3420 Lecture 07: Pipeline Bei Yu byu@cse.cuhk.edu.hk CENG3420 L07.1 Spring 2018 Outline q Review: Flip-Flop Control Signals q Pipeline Motivations q Pipeline Hazards q Exceptions CENG3420 L07.2 Spring 2018 Outline q Review:


  1. CENG 3420 Lecture 07: Pipeline Bei Yu byu@cse.cuhk.edu.hk CENG3420 L07.1 Spring 2018

  2. Outline q Review: Flip-Flop Control Signals q Pipeline Motivations q Pipeline Hazards q Exceptions CENG3420 L07.2 Spring 2018

  3. Outline q Review: Flip-Flop Control Signals q Pipeline Motivations q Pipeline Hazards q Exceptions CENG3420 L07.3 Spring 2018

  4. Clocking Methodologies q Clocking methodology defines when signals can be read and when they can be written falling (negative) edge clock cycle rising (positive) edge clock rate = 1/(clock cycle) e.g., 10 nsec clock cycle = 100 MHz clock rate 1 nsec clock cycle = 1 GHz clock rate q State element design choices ● level sensitive latch ● master-slave and edge-triggered flipflops CENG3420 L07.4 Spring 2018

  5. Review:Latches vs Flipflops q Output is equal to the stored value inside the element q Change of state (value) is based on the clock ● Latches: output changes whenever the inputs change and the clock is asserted (level sensitive methodology) - Two-sided timing constraint ● Flip-flop: output changes only on a clock edge (edge- triggered methodology) - One-sided timing constraint A clocking methodology defines when signals can be read and written – would NOT want to read a signal at the same time it was being written CENG3420 L07.5 Spring 2018

  6. Review: Design A Latch q Store one bit of information: cross-coupled invertor = q How to change the value stored? R: reset signal S: set signal SR-Latch other Latch structures CENG3420 L07.6 Spring 2018

  7. Review: Design A Flip-Flop q Based on Gated Latch = q Master-slave positive-edge-triggered D flip-flop CENG3420 L07.7 Spring 2018

  8. Review: Latch and Flip-Flop q Latch is level-sensitive q Flip-flop is edge triggered CENG3420 L07.8 Spring 2018

  9. Our Implementation q An edge-triggered methodology q Typical execution ● read contents of some state elements ● send values through some combinational logic ● write results to one or more state elements State State Combinational element element logic 1 2 clock one clock cycle q Assumes state elements are written on every clock cycle; if not, need explicit write control signal ● write occurs only when both the write control is asserted and the clock edge occurs CENG3420 L07.9 Spring 2018

  10. Outline q Review: Flip-Flop Control Signals q Pipeline Motivations q Pipeline Hazards q Exceptions CENG3420 L07.10 Spring 2018

  11. Review: Instruction Critical Paths q Calculate cycle time assuming negligible delays (for muxes, control unit, sign extend, PC access, shift left 2, wires) except: ● Instruction and Data Memory (4 ns) ● ALU and adders (2 ns) ● Register File access (reads or writes) (1 ns) Instr. I Mem Reg Rd ALU Op D Mem Reg Wr Total R- 4 1 2 1 8 type 4 1 2 4 1 12 load 4 1 2 4 11 store 4 1 2 7 beq jump 4 4 CENG3420 L07.11 Spring 2018

  12. Review: Single Cycle Disadvantages & Advantages q Uses the clock cycle inefficiently – the clock cycle must be timed to accommodate the slowest instr ● especially problematic for more complex instructions like floating point multiply Cycle 1 Cycle 2 Clk lw sw Waste q May be wasteful of area since some functional units (e.g., adders) must be duplicated since they can not be shared during a clock cycle but q It is simple and easy to understand CENG3420 L07.12 Spring 2018

  13. How Can We Make It Faster? q Start fetching and executing the next instruction before the current one has completed ● Pipelining – (all?) modern processors are pipelined for performance ● Remember the performance equation: CPU time = CPI * CC * IC q Under ideal conditions and with a large number of instructions, the speedup from pipelining is approximately equal to the number of pipe stages ● A five stage pipeline is nearly five times faster because the CC is “nearly” five times faster q Fetch (and execute) more than one instruction at a time ● Superscalar processing – stay tuned CENG3420 L07.13 Spring 2018

  14. The Five Stages of Load Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 IFetch Dec Exec Mem WB lw q IFetch: Instruction Fetch and Update PC q Dec: Registers Fetch and Instruction Decode q Exec: Execute R-type; calculate memory address q Mem: Read/write the data from/to the Data Memory q WB: Write the result data into the register file CENG3420 L07.14 Spring 2018

  15. A Pipelined MIPS Processor q Start the next instruction before the current one has completed ● improves throughput - total amount of work done in a given time ● instruction latency (execution time, delay time, response time - time from the start of an instruction to its completion) is not reduced Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 IFetch Dec Exec Mem WB lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem WB R-type - clock cycle (pipeline stage time) is limited by the slowest stage - for some stages don’t need the whole clock cycle (e.g., WB) - for some instructions, some stages are wasted cycles (i.e., nothing is done during that cycle for that instruction) CENG3420 L07.15 Spring 2018

  16. Single Cycle versus Pipeline Single Cycle Implementation (CC = 800 ps): Cycle 1 Cycle 2 Clk lw sw Waste 400 ps Pipeline Implementation (CC = 200 ps): lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem WB IFetch Dec Exec Mem WB R-type q To complete an entire instruction in the pipelined case takes 1000 ps (as compared to 800 ps for the single cycle case). Why ? q How long does each take to complete 1,000,000 adds ? CENG3420 L07.16 Spring 2018

  17. Pipelining the MIPS ISA q What makes it easy ● all instructions are the same length (32 bits) - can fetch in the 1 st stage and decode in the 2 nd stage ● few instruction formats (three) with symmetry across formats - can begin reading register file in 2 nd stage ● memory operations occur only in loads and stores - can use the execute stage to calculate memory addresses ● each instruction writes at most one result (i.e., changes the machine state) and does it in the last few pipeline stages (MEM or WB) ● operands must be aligned in memory so a single data transfer takes only one data memory access CENG3420 L07.17 Spring 2018

  18. MIPS Pipeline Datapath Additions/Mods q State registers between each pipeline stage to isolate them IF:IFetch ID:Dec EX:Execute MEM: WB: MemAccess WriteBack IF/ID ID/EX EX/MEM Add Add MEM/WB Shift 4 left 2 Read Addr 1 Instruction Data Register Read Memory Memory Data 1 Read Addr 2 Read PC File Read Address Address ALU Write Addr Data Read Data 2 Write Data Write Data Sign Extend 16 32 System Clock CENG3420 L07.18 Spring 2018

  19. MIPS Pipeline Control Path Modifications q All control signals can be determined during Decode ● and held in the state registers between pipeline stages PCSrc ID/EX EX/MEM Control IF/ID Add MEM/WB Branch Add RegWrite Shift 4 left 2 Read Addr 1 Instruction Data Read Register Memory Memory Data 1 MemtoReg Read Addr 2 ALUSrc Read PC File Read Address Address ALU Write Addr Data Read Data 2 Write Data Write Data ALU cntrl MemRead Sign Extend 16 32 ALUOp RegDst CENG3420 L07.19 Spring 2018

  20. Pipeline Control q IF Stage: read Instr Memory (always asserted) and write PC (on System Clock) q ID Stage: no optional control signals to set EX Stage MEM Stage WB Stage Reg ALU ALU ALU Brch Mem Mem Reg Mem Dst Op1 Op0 Src Read Write Write toReg R 1 1 0 0 0 0 0 1 0 lw 0 0 0 1 0 1 0 1 1 X 0 0 1 0 0 1 0 X sw beq X 0 1 0 1 0 0 0 X CENG3420 L07.20 Spring 2018

  21. Graphically Representing MIPS Pipeline ALU IM DM Reg Reg q Can help with answering questions like: ● How many cycles does it take to execute this code? ● What is the ALU doing during cycle 4? ● Is there a hazard, why does it occur, and how can it be fixed? CENG3420 L07.21 Spring 2018

  22. Other Pipeline Structures Are Possible q What about the (slow) multiply operation? ● Make the clock twice as slow or … ● let it take two cycles (since it doesn’t use the DM stage) MUL ALU IM DM Reg Reg q What if the data memory access is twice as slow as the instruction memory? ● make the clock twice as slow or … ● let data memory access take two cycles (and keep the same clock rate) ALU IM DM1 DM2 Reg Reg CENG3420 L07.22 Spring 2018

  23. Other Sample Pipeline Alternatives q ARM7 IM Reg EX PC update decode ALU op IM access reg DM access access shift/rotate commit result (write back) q XScale Reg ALU IM1 DM1 IM2 Reg SHFT DM2 PC update decode DM write ALU op BTB access reg 1 access reg write start IM access start DM access shift/rotate IM access exception reg 2 access CENG3420 L07.23 Spring 2018

  24. Why Pipeline? For Performance! Time (clock cycles) Once the Inst 0 ALU IM DM Reg pipeline is full, Reg I one instruction n is completed s Inst 1 ALU IM DM Reg Reg t every cycle, so r. CPI = 1 ALU Inst 2 IM DM Reg Reg O r d ALU Inst 3 IM DM Reg Reg e r ALU IM DM Reg Inst 4 Reg Time to fill the pipeline CENG3420 L07.24 Spring 2018

  25. Outline q Review: Flip-Flop Control Signals q Pipeline Motivations q Pipeline Hazards q Exceptions CENG3420 L07.25 Spring 2018

Recommend


More recommend