Spring 2016 :: CSE 502 – Computer Architecture Processor Pipeline Nima Honarmand
Spring 2016 :: CSE 502 – Computer Architecture Generic Instruction Cycle • Steps in processing an instruction: – Instruction Fetch ( IF_STEP ) – Instruction Decode ( ID_STEP ) – Operand Fetch ( OF_STEP ) • Might be from registers or memory – Execute ( EX_STEP ) • Perform computation on the operands – Result Store or Write Back ( RS_STEP ) • Write the execution results back to registers or memory • ISA determines what needs to be done in each step for each instruction • μ Arch determines how HW implements the steps
Spring 2016 :: CSE 502 – Computer Architecture Datapath vs. Control Logic • Datapath is the collection of HW components and their connection in a processor – Determines the static structure of processor • Control logic determines the dynamic flow of data between the components – E.g., the control lines of MUXes and ALU in last slide – Is a function of? • Instruction words • State of the processor • Execution results at each stage
Spring 2016 :: CSE 502 – Computer Architecture Generic Datapath Components • Main components – Instruction Cache – Data Cache – Register File – Functional Units (ALU, Floating Point Unit, Memory Unit, …) – Pipeline Registers • Auxiliary Components (in advanced processors) – Reservation Stations – Reorder Buffer – Branch Predictor – Prefetchers – … • Lots of glue logic (often multiplexors) to glue these together
Spring 2016 :: CSE 502 – Computer Architecture Example: MIPS Instruction Set • All instructions are 32 bits
Spring 2016 :: CSE 502 – Computer Architecture A Simple MIPS Datapath Write-Back (WB) + 1 Reg ALU PC File I-cache D-cache Inst. Decode & Execute Memory Inst. Fetch Register Read (IF) (EX) (MEM) (ID) IF_STEP ID_STEP OF_STEP EX_STEP RS_STEP
Spring 2016 :: CSE 502 – Computer Architecture Single-Instruction Datapath Single-cycle ins0.(fetch,dec,ex,mem,wb) ins1.(fetch,dec,ex,mem,wb) Multi-cycle ins0.fetch ins0.(dec,ex) ins0.(mem,wb) ins1.fetch ins1.(dec,ex) ins1.(mem,wb) time • Process one instruction at a time • Single-cycle control: hardwired – Low CPI (1) – Long clock period (to accommodate slowest instruction) • Multi-cycle control: typically micro-programmed – Short clock period – High CPI • Can we have both low CPI and short clock period? – Not if datapath executes only one instruction at a time – No good way to make a single instruction go faster
Spring 2016 :: CSE 502 – Computer Architecture Pipelined Datapath Multi-cycle ins0.fetch ins0.(dec,ex) ins0.(mem,wb) ins1.fetch ins1.(dec,ex) ins1.(mem,wb) ins0.fetch ins0.(dec,ex) ins0.(mem,wb) Pipelined ins1.fetch ins1.(dec,ex) ins1.(mem,wb) time ins2.fetch ins2.(dec,ex) ins2.(mem,wb) • Start with multi-cycle design • When insn0 goes from stage 1 to stage 2 … insn1 starts stage 1 • Each instruction passes through all stages … but instructions enter and leave at faster rate Style Ideal CPI Cycle Time (1/freq) Single-cycle 1 Long Multi-cycle > 1 Short Pipelined 1 Short Pipeline can have as many insns in flight as there are stages
Spring 2016 :: CSE 502 – Computer Architecture Pipeline Illustrated Comb. Logic BW = ~(1/n) L n Gate Delay n n Gate Gate L -- L -- BW = ~(2/n) Delay Delay 2 2 n n n Gate Gate Gate L L -- -- L -- BW = ~(3/n) Delay Delay Delay 3 3 3 Pipeline Latency = n Gate Delay + (p-1) register delays p: # of stages Improves throughput at the expense of latency
Spring 2016 :: CSE 502 – Computer Architecture 5-Stage MIPS Pipeline
Spring 2016 :: CSE 502 – Computer Architecture Stage 1: Fetch • Fetch an instruction from instruction cache every cycle – Use PC to index instruction cache – Increment PC (assume no branches for now) • Write state to the pipeline register (IF/ID) – The next stage will read this pipeline register
Spring 2016 :: CSE 502 – Computer Architecture Stage 1: Fetch Diagram target M U X 1 PC + 1 + Decode PC Instruction en Instruction bits Cache en IF / ID Pipeline register
Spring 2016 :: CSE 502 – Computer Architecture Stage 2: Decode • Decodes opcode bits – Set up Control signals for later stages • Read input operands from register file – Specified by decoded instruction bits • Write state to the pipeline register (ID/EX) – Opcode – Register contents, immediate operand – PC+1 (even though decode didn’t use it) – Control signals (from insn) for opcode and destReg
Spring 2016 :: CSE 502 – Computer Architecture Stage 2: Decode Diagram target PC + 1 PC + 1 regA contents regA regB Execute Fetch Register File destReg contents regB data Instruction en bits Signals/imm Control IF / ID ID / EX Pipeline register Pipeline register
Spring 2016 :: CSE 502 – Computer Architecture Stage 3: Execute • Perform ALU operations – Calculate result of instruction • Control signals select operation • Contents of regA used as one input • Either regB or constant offset (imm from insn) used as second input – Calculate PC-relative branch target • PC+1+(constant offset) • Write state to the pipeline register (EX/Mem) – ALU result, contents of regB, and PC+1+offset – Control signals (from insn) for opcode and destReg
Spring 2016 :: CSE 502 – Computer Architecture Stage 3: Execute Diagram target +offset PC+1 PC + 1 + contents result ALU regA A Memory Decode L U M contents contents regB U regB X Signals/imm Control Control Signals destReg data ID / EX EX/Mem Pipeline register Pipeline register
Spring 2016 :: CSE 502 – Computer Architecture Stage 4: Memory • Perform data cache access – ALU result contains address for LD or ST – Opcode bits control R/W and enable signals • Write state to the pipeline register (Mem/WB) – ALU result and Loaded data – Control signals (from insn) for opcode and destReg
Spring 2016 :: CSE 502 – Computer Architecture Stage 4: Memory Diagram target +offset PC+1 result ALU result ALU Write-back in_addr Execute Loaded contents data in_data regB Data Cache en R/W Control Control signals signals destReg data EX/Mem Mem/WB Pipeline register Pipeline register
Spring 2016 :: CSE 502 – Computer Architecture Stage 5: Write-back • Writing result to register file (if required) – Write Loaded data to destReg for LD – Write ALU result to destReg for ALU insn – Opcode bits control register write enable signal
Spring 2016 :: CSE 502 – Computer Architecture Stage 5: Write-back Diagram result ALU Loaded data Memory M data U X Control signals M destReg U Mem/WB X Pipeline register
Spring 2016 :: CSE 502 – Computer Architecture Putting It All Together M U X + 1 target + PC+1 PC+1 eq? ALU regA instruction M result regB valA U A Register Inst ALU PC X mdata File L data Cache result Data valB U M dest U Cache data X dest signals/imm valB Control M Control Control U signals signals X IF/ID ID/EX EX/Mem Mem/WB
Spring 2016 :: CSE 502 – Computer Architecture Issues With Pipelining
Spring 2016 :: CSE 502 – Computer Architecture Pipelining Idealism • Uniform Sub-operations – Operation can partitioned into uniform-latency sub-ops • Repetition of Identical Operations – Same ops performed on many different inputs • Independent Operations – All ops are mutually independent
Spring 2016 :: CSE 502 – Computer Architecture Pipeline Realism • Uniform Sub- operations … NOT! – Balance pipeline stages • Stage quantization to yield balanced stages • Minimize internal fragmentation (left-over time near end of cycle) • Repetition of Identical Operations … NOT! – Unifying instruction types • Coalescing instruction types into one “multi - function” pipe • Minimize external fragmentation (idle stages to match length) • Independent Operations … NOT! – Resolve data and resource hazards • Inter-instruction dependency detection and resolution Pipelining is expensive
Spring 2016 :: CSE 502 – Computer Architecture The Generic Instruction Pipeline IF Instruction Fetch ID Instruction Decode OF Operand Fetch EX Instruction Execute WB Write-back
Spring 2016 :: CSE 502 – Computer Architecture Balancing Pipeline Stages IF T IF = 6 units Without pipelining T cyc T IF +T ID +T OF +T EX +T OS ID T ID = 2 units = 31 Pipelined T cyc max{T IF , T ID , T OF , T EX , T OS } OF T ID = 9 units = 9 EX Speedup = 31 / 9 = 3.44 T EX = 5 units WB T OS = 9 units Can we do better?
Spring 2016 :: CSE 502 – Computer Architecture Balancing Pipeline Stages (1/2) • Two methods for stage quantization – Divide sub-ops into smaller pieces – Merge multiple sub-ops into one • Recent/Current trends – Deeper pipelines (more and more stages) – Pipelining of memory accesses – Multiple different pipelines/sub-pipelines
Spring 2016 :: CSE 502 – Computer Architecture Balancing Pipeline Stages (2/2) Coarser-Grained Machine Cycle: Finer-Grained Machine Cycle: 4 machine cyc / instruction 11 machine cyc /instruction IF IF T IF&ID = 8 units IF ID ID OF OF T OF = 9 units OF # stages = 4 # stages = 11 OF T cyc = 9 units EX T cyc = 3 units T EX = 5 units EX EX WB T OS = 9 units WB WB WB
Recommend
More recommend