CSCI341 Lecture 37, Introduction to Parallelism
PIPELINING “Exploits potential parallelism among instructions.” “Instruction-level parallelism”
INSTRUCTION-LEVEL PARALLELISM • Increase depth of pipeline (greater overlap of instructions) • Replicate hardware (handle more instructions simultaneously) • aka “multiple issue”
MULTIPLE ISSUE • Instruction execution can exceed clock rate • CPI less than 1
EXAMPLE A 4GHz four-way multiple issue microprocessor... • 16 billion instructions per second • Ideal CPI of 0.25 (IPC of 4) • In a five-stage pipeline, 20 instructions in progress at once (modern CPUs approach 3 - 6 instructions per cycle)
2 IMPORTANT IMPLEMENTATIONS • Compile-time (statically) • During execution (dynamically)
CHALLENGES • How does the CPU determine how many instructions (and which instructions) can be issued? • How do we deal with data/control hazards?
SPECULATION • The compiler / CPU “guesses” about the properties of an instructions. • eg, branching, storing & loading • Potential for bad guesses (changing the decision is complex) • Buffering speculated instructions • Buffering exceptions
STATIC MULTIPLE ISSUE SYSTEM Heavy reliance on the compiler.
ISSUE PACKET • Set of instructions issued in a given clock cycle • Very Long Instruction Word (VLIW)
CONSIDER... A two-issue MIPS processor. • One instruction can be ALU operation or branch • The other can be load/store (lets call it “TIM”)
TIM • How many bits of instructions per cycle? • Instructions paired, aligned. • ALU/branch instruction is “first.” • If one member of the pair can’t be used, replace with nop.
TIM Two instructions per stage at a time.
TIM HAZARDS • Sometimes, it’s the compiler’s full responsibility • Remove hazards by arranging/scheduling instructions • Inserting NOPs where necessary, etc
TIM HAZARDS • Sometimes, the hardware detects hazards between issue packets • Generates stalls • Still relies on compiler to generate appropriate packets
TIM’S DATAPATH
TIM’S DATAPATH 32 more bits from instruction memory Two more read ‘ports’ one more write ‘port’ Extra ALU
NO MAGIC SPEED BOOST Potential to double performance. Potential for hazards to impact two instructions.
that’s a noun USE LATENCY “Number of clock cycles between a load instruction and an instruction that can use the result of the load without stalling the pipeline.”
MIPS use latency of one cycle TIM Potentially impacts two instructions.
TIM really needs to rely on the compiler.
EXAMPLE Loop: lw $t0, 0($s1) addu $t0, $t0, $s2 sw $t0, 0($s1) addi $s1, $s1, -4 bne $s1, $zero, Loop How might we schedule this for TIM?
lw $t0, 0($s1) addu $t0, $t0, $s2 sw $t0, 0($s1) EXAMPLE addi $s1, $s1, -4 bne $s1, $zero, Loop ALU/branch ins. Data xfer ins. clock cycle Loop: 1 lw $t0, 0($s1) 2 addi $s1, $s1, -4 3 addu $t0, $t0, $s2 4 bne $s1, $zero, Loop sw $t0, 4($s1)
LOOP UNROLLING (compiler technique) For loops that access arrays, make multiple copies of the loop body. Schedule instructions from different iterations together.
LOOP UNROLLING Challenge: how does this work? (p396 - 398)
HOMEWORK • Reading 31 • Continue Project 8 TIMmeh!
Recommend
More recommend