PIPELINING: 5-STAGE PIPELINE Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture
Overview ¨ Announcement ¤ Tonight: Homework 1 deadline (11:59PM) n Verify your uploaded files before deadline ¨ This lecture ¤ Impacts of pipelining on performance ¤ The MIPS five-stage pipeline ¤ Pipeline hazards n Structural hazards n Data hazards
Single-cycle RISC Architecture ¨ Example: simple MIPS architecture ¤ Critical path includes all of the processing steps Write Back Controller PC Inst. Register Data ALU Memory File Memory Inst. Fetch Inst. Decode Execute Memory
Single-cycle RISC Architecture ¨ Example program ¤ CT=6ns; CPU Time = ? AND R1,R2,R3 XOR R4,R2,R3 SUB R5,R1,R4 ADD R6,R1,R4 MUL R7,R5,R6 Time CPU Time = IC x CPI x CT
Single-cycle RISC Architecture ¨ Example program ¤ CT=6ns; CPU Time = 5 x 1 x 6ns = 30ns AND R1,R2,R3 How to improve? XOR R4,R2,R3 SUB R5,R1,R4 ADD R6,R1,R4 MUL R7,R5,R6 Time CPU Time = IC x CPI x CT
Reusing Idle Resources ¨ Each processing step finishes in a fraction of a cycle ¤ Idle resources can be reused for processing next instructions Write Back PC Inst. Register Data ALU Memory File Memory Inst. Fetch Inst. Decode Execute Memory
Pipelined Architecture ¨ Five stage pipeline ¤ Critical path determines the cycle time 0.7ns Write Back PC Inst. Register Data ALU Memory File Memory Inst. Fetch Inst. Decode Execute Memory 1.5ns 1.05ns 1.25ns 1.5ns
Pipelined Architecture ¨ Example program ¤ CT=1.5ns; CPU Time = ? AND R1,R2,R3 XOR R4,R2,R3 SUB R5,R1,R4 ADD R6,R1,R4 MUL R7,R5,R6 Time CPU Time = IC x CPI x CT
Pipelined Architecture ¨ Example program ¤ CT=1.5ns; CPU Time = 5 x 5 x 1.5ns = 37.5ns > 30ns WORSE!! AND R1,R2,R3 XOR R4,R2,R3 SUB R5,R1,R4 ADD R6,R1,R4 MUL R7,R5,R6 Time CPU Time = IC x CPI x CT
Pipelined Architecture ¨ Example program ¤ CT=1.5ns; CPU Time = ? AND R1,R2,R3 XOR R4,R2,R3 SUB R5,R1,R4 ADD R6,R1,R4 MUL R7,R5,R6 Time CPU Time = IC x CPI x CT
Pipelined Architecture ¨ Example program ¤ CT=1.5ns; CPU Time = 9 x 1 x 1.5ns = 13.5ns What is the cost of pipelining? AND R1,R2,R3 XOR R4,R2,R3 SUB R5,R1,R4 ADD R6,R1,R4 MUL R7,R5,R6 Time CPU Time = IC x CPI x CT
Pipelining Technique ¨ Improving throughput at the expense of latency ¤ Delay: D = T + n δ ¤ Throughput: IPS = n/(T + n δ ) Combinational Logic Critical Path Delay = 30
Pipelining Technique ¨ Improving throughput at the expense of latency ¤ Delay: D = T + n δ ¤ Throughput: IPS = n/(T + n δ ) Combinational Logic D = Critical Path Delay = 30 IPS = Combinational Logic Combinational Logic D = IPS = Critical Path Delay = 15 Critical Path Delay = 15 Comb. Logic Comb. Logic Comb. Logic D = IPS = Delay = 10 Delay = 10 Delay = 10
Pipelining Technique ¨ Improving throughput at the expense of latency ¤ Delay: D = T + n δ ¤ Throughput: IPS = n/(T + n δ ) Combinational Logic D = 31 Critical Path Delay = 30 IPS = 1/31 Combinational Logic Combinational Logic D = 32 IPS = 2/32 Critical Path Delay = 15 Critical Path Delay = 15 Comb. Logic Comb. Logic Comb. Logic D = 33 IPS = 3/33 Delay = 10 Delay = 10 Delay = 10
Pipelining Latency vs. Throughput ¨ Theoretical delay and throughput models for perfect pipelining Delay (D) 20 Relative Performance 15 10 5 0 0 50 100 150 200 Number of Pipeline Stages
Pipelining Latency vs. Throughput ¨ Theoretical delay and throughput models for perfect pipelining Delay (D) Throughput (IPS) 20 Relative Performance 15 10 5 0 0 50 100 150 200 Number of Pipeline Stages
Five Stage MIPS Pipeline
Simple Five Stage Pipeline ¨ A pipelined load-store architecture that processes up to one instruction per cycle Write Back PC Inst. Register Data ALU Memory File Memory Inst. Fetch Inst. Decode Execute Memory
Instruction Fetch ¨ Read an instruction from memory (I-Memory) ¤ Use the program counter (PC) to index into the I- Memory ¤ Compute NPC by incrementing current PC n What about branches? ¨ Update pipeline registers ¤ Write the instruction into the pipeline registers
Instruction Fetch clock Branch Target NPC = PC + 4 NPC clock PC + Why increment 4 by 4? Instruction Memory Pipeline Register
Instruction Fetch clock P3 Branch Target NPC = PC + 4 NPC clock PC + P2 Why increment 4 by 4? Instruction P1 Memory Critical Path = Max{P1, P2, P3} Pipeline Register
Instruction Decode ¨ Generate control signals for the opcode bits ¨ Read source operands from the register file (RF) ¤ Use the specifiers for indexing RF n How many read ports are required? ¨ Update pipeline registers ¤ Send the operand and immediate values to next stage ¤ Pass control signals and NPC to next stage
Instruction Decode target NPC NPC reg Register Instruction File reg ctrl decode Pipeline Pipeline Register Register
Execute Stage ¨ Perform ALU operation ¤ Compute the result of ALU n Operation type: control signals n First operand: contents of a register n Second operand: either a register or the immediate value ¤ Compute branch target n Target = NPC + immediate ¨ Update pipeline registers ¤ Control signals, branch target, ALU results, and destination
Execute Stage Target NPC + Res reg ALU reg reg ctrl ctrl Pipeline Pipeline Register Register
Memory Access ¨ Access data memory ¤ Load/store address: ALU outcome ¤ Control signals determine read or write access ¨ Update pipeline registers ¤ ALU results from execute ¤ Loaded data from D-Memory ¤ Destination register
Memory Access Target Res Res addr Dat reg Memory data data ctrl ctrl Pipeline Pipeline Register Register
Register Write Back ¨ Update register file ¤ Control signals determine if a register write is needed ¤ Only one write port is required n Write the ALU result to the destination register, or n Write the loaded data into the register file
Five Stage Pipeline ¨ Ideal pipeline: IPC=1 ¤ Is there enough resources to keep the pipeline stages busy all the time? Inst. Fetch Decode Execute Memory Writeback + + PC ALU Reg. Reg. 4 File File Mem Mem
Pipeline Hazards
Pipeline Hazards ¨ Structural hazards: multiple instructions compete for the same resource ¨ Data hazards: a dependent instruction cannot proceed because it needs a value that hasn’t been produced ¨ Control hazards: the next instruction cannot be fetched because the outcome of an earlier branch is unknown
Structural Hazards ¨ 1. Unified memory for instruction and data R1 ß Mem[R2] R3 ß Mem[R20] R6 ß R4-R5 R7 ß R1+R0
Structural Hazards ¨ 1. Unified memory for instruction and data R1 ß Mem[R2] R3 ß Mem[R20] R6 ß R4-R5 R7 ß R1+R0 Separate inst. and data memories.
Structural Hazards ¨ 1. Unified memory for instruction and data ¨ 2. Register file with shared read/write access ports R1 ß Mem[R2] R3 ß Mem[R20] R6 ß R4-R5 R7 ß R1+R0
Structural Hazards ¨ 1. Unified memory for instruction and data ¨ 2. Register file with shared read/write access ports R1 ß Mem[R2] R3 ß Mem[R20] R6 ß R4-R5 R7 ß R1+R0 Register access in half cycles.
Recommend
More recommend