CENG 3420 Lecture 07: Pipeline Bei Yu byu@cse.cuhk.edu.hk CENG3420 L07.1 Spring 2018
Outline q Review: Flip-Flop Control Signals q Pipeline Motivations q Pipeline Hazards q Exceptions CENG3420 L07.2 Spring 2018
Outline q Review: Flip-Flop Control Signals q Pipeline Motivations q Pipeline Hazards q Exceptions CENG3420 L07.3 Spring 2018
Clocking Methodologies q Clocking methodology defines when signals can be read and when they can be written falling (negative) edge clock cycle rising (positive) edge clock rate = 1/(clock cycle) e.g., 10 nsec clock cycle = 100 MHz clock rate 1 nsec clock cycle = 1 GHz clock rate q State element design choices ● level sensitive latch ● master-slave and edge-triggered flipflops CENG3420 L07.4 Spring 2018
Review:Latches vs Flipflops q Output is equal to the stored value inside the element q Change of state (value) is based on the clock ● Latches: output changes whenever the inputs change and the clock is asserted (level sensitive methodology) - Two-sided timing constraint ● Flip-flop: output changes only on a clock edge (edge- triggered methodology) - One-sided timing constraint A clocking methodology defines when signals can be read and written – would NOT want to read a signal at the same time it was being written CENG3420 L07.5 Spring 2018
Review: Design A Latch q Store one bit of information: cross-coupled invertor = q How to change the value stored? R: reset signal S: set signal SR-Latch other Latch structures CENG3420 L07.6 Spring 2018
Review: Design A Flip-Flop q Based on Gated Latch = q Master-slave positive-edge-triggered D flip-flop CENG3420 L07.7 Spring 2018
Review: Latch and Flip-Flop q Latch is level-sensitive q Flip-flop is edge triggered CENG3420 L07.8 Spring 2018
Our Implementation q An edge-triggered methodology q Typical execution ● read contents of some state elements ● send values through some combinational logic ● write results to one or more state elements State State Combinational element element logic 1 2 clock one clock cycle q Assumes state elements are written on every clock cycle; if not, need explicit write control signal ● write occurs only when both the write control is asserted and the clock edge occurs CENG3420 L07.9 Spring 2018
Outline q Review: Flip-Flop Control Signals q Pipeline Motivations q Pipeline Hazards q Exceptions CENG3420 L07.10 Spring 2018
Review: Instruction Critical Paths q Calculate cycle time assuming negligible delays (for muxes, control unit, sign extend, PC access, shift left 2, wires) except: ● Instruction and Data Memory (4 ns) ● ALU and adders (2 ns) ● Register File access (reads or writes) (1 ns) Instr. I Mem Reg Rd ALU Op D Mem Reg Wr Total R- 4 1 2 1 8 type 4 1 2 4 1 12 load 4 1 2 4 11 store 4 1 2 7 beq jump 4 4 CENG3420 L07.11 Spring 2018
Review: Single Cycle Disadvantages & Advantages q Uses the clock cycle inefficiently – the clock cycle must be timed to accommodate the slowest instr ● especially problematic for more complex instructions like floating point multiply Cycle 1 Cycle 2 Clk lw sw Waste q May be wasteful of area since some functional units (e.g., adders) must be duplicated since they can not be shared during a clock cycle but q It is simple and easy to understand CENG3420 L07.12 Spring 2018
How Can We Make It Faster? q Start fetching and executing the next instruction before the current one has completed ● Pipelining – (all?) modern processors are pipelined for performance ● Remember the performance equation: CPU time = CPI * CC * IC q Under ideal conditions and with a large number of instructions, the speedup from pipelining is approximately equal to the number of pipe stages ● A five stage pipeline is nearly five times faster because the CC is “nearly” five times faster q Fetch (and execute) more than one instruction at a time ● Superscalar processing – stay tuned CENG3420 L07.13 Spring 2018
The Five Stages of Load Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 IFetch Dec Exec Mem WB lw q IFetch: Instruction Fetch and Update PC q Dec: Registers Fetch and Instruction Decode q Exec: Execute R-type; calculate memory address q Mem: Read/write the data from/to the Data Memory q WB: Write the result data into the register file CENG3420 L07.14 Spring 2018
A Pipelined MIPS Processor q Start the next instruction before the current one has completed ● improves throughput - total amount of work done in a given time ● instruction latency (execution time, delay time, response time - time from the start of an instruction to its completion) is not reduced Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 IFetch Dec Exec Mem WB lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem WB R-type - clock cycle (pipeline stage time) is limited by the slowest stage - for some stages don’t need the whole clock cycle (e.g., WB) - for some instructions, some stages are wasted cycles (i.e., nothing is done during that cycle for that instruction) CENG3420 L07.15 Spring 2018
Single Cycle versus Pipeline Single Cycle Implementation (CC = 800 ps): Cycle 1 Cycle 2 Clk lw sw Waste 400 ps Pipeline Implementation (CC = 200 ps): lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem WB IFetch Dec Exec Mem WB R-type q To complete an entire instruction in the pipelined case takes 1000 ps (as compared to 800 ps for the single cycle case). Why ? q How long does each take to complete 1,000,000 adds ? CENG3420 L07.16 Spring 2018
Pipelining the MIPS ISA q What makes it easy ● all instructions are the same length (32 bits) - can fetch in the 1 st stage and decode in the 2 nd stage ● few instruction formats (three) with symmetry across formats - can begin reading register file in 2 nd stage ● memory operations occur only in loads and stores - can use the execute stage to calculate memory addresses ● each instruction writes at most one result (i.e., changes the machine state) and does it in the last few pipeline stages (MEM or WB) ● operands must be aligned in memory so a single data transfer takes only one data memory access CENG3420 L07.17 Spring 2018
MIPS Pipeline Datapath Additions/Mods q State registers between each pipeline stage to isolate them IF:IFetch ID:Dec EX:Execute MEM: WB: MemAccess WriteBack IF/ID ID/EX EX/MEM Add Add MEM/WB Shift 4 left 2 Read Addr 1 Instruction Data Register Read Memory Memory Data 1 Read Addr 2 Read PC File Read Address Address ALU Write Addr Data Read Data 2 Write Data Write Data Sign Extend 16 32 System Clock CENG3420 L07.18 Spring 2018
MIPS Pipeline Control Path Modifications q All control signals can be determined during Decode ● and held in the state registers between pipeline stages PCSrc ID/EX EX/MEM Control IF/ID Add MEM/WB Branch Add RegWrite Shift 4 left 2 Read Addr 1 Instruction Data Read Register Memory Memory Data 1 MemtoReg Read Addr 2 ALUSrc Read PC File Read Address Address ALU Write Addr Data Read Data 2 Write Data Write Data ALU cntrl MemRead Sign Extend 16 32 ALUOp RegDst CENG3420 L07.19 Spring 2018
Pipeline Control q IF Stage: read Instr Memory (always asserted) and write PC (on System Clock) q ID Stage: no optional control signals to set EX Stage MEM Stage WB Stage Reg ALU ALU ALU Brch Mem Mem Reg Mem Dst Op1 Op0 Src Read Write Write toReg R 1 1 0 0 0 0 0 1 0 lw 0 0 0 1 0 1 0 1 1 X 0 0 1 0 0 1 0 X sw beq X 0 1 0 1 0 0 0 X CENG3420 L07.20 Spring 2018
Graphically Representing MIPS Pipeline ALU IM DM Reg Reg q Can help with answering questions like: ● How many cycles does it take to execute this code? ● What is the ALU doing during cycle 4? ● Is there a hazard, why does it occur, and how can it be fixed? CENG3420 L07.21 Spring 2018
Other Pipeline Structures Are Possible q What about the (slow) multiply operation? ● Make the clock twice as slow or … ● let it take two cycles (since it doesn’t use the DM stage) MUL ALU IM DM Reg Reg q What if the data memory access is twice as slow as the instruction memory? ● make the clock twice as slow or … ● let data memory access take two cycles (and keep the same clock rate) ALU IM DM1 DM2 Reg Reg CENG3420 L07.22 Spring 2018
Other Sample Pipeline Alternatives q ARM7 IM Reg EX PC update decode ALU op IM access reg DM access access shift/rotate commit result (write back) q XScale Reg ALU IM1 DM1 IM2 Reg SHFT DM2 PC update decode DM write ALU op BTB access reg 1 access reg write start IM access start DM access shift/rotate IM access exception reg 2 access CENG3420 L07.23 Spring 2018
Why Pipeline? For Performance! Time (clock cycles) Once the Inst 0 ALU IM DM Reg pipeline is full, Reg I one instruction n is completed s Inst 1 ALU IM DM Reg Reg t every cycle, so r. CPI = 1 ALU Inst 2 IM DM Reg Reg O r d ALU Inst 3 IM DM Reg Reg e r ALU IM DM Reg Inst 4 Reg Time to fill the pipeline CENG3420 L07.24 Spring 2018
Outline q Review: Flip-Flop Control Signals q Pipeline Motivations q Pipeline Hazards q Exceptions CENG3420 L07.25 Spring 2018
Recommend
More recommend