1
forwarding idea read wrong value (e.g. from register) correct value is already computed elsewhere in pipeline maybe even after old value was read substitute from wrong value using MUX 2
quiz question: forwarding in IRMOVQ irmovq $50, %r8 addq %r11, %r8 output of decode/execute regs ( irmovq ) (unchanged during execute stage) input of execute/memory regs ( irmovq ) input of decode/execute regs ( addq ) 3 cycle # 0 1 2 3 4 5 6 7 8 F D E M W F D E M W
quiz question: forwarding in IRMOVQ output of decode/execute regs ( irmovq ) (unchanged during execute stage) input of execute/memory regs ( irmovq ) input of decode/execute regs ( addq ) 3 cycle # 0 1 2 3 4 5 6 7 8 irmovq $50, %r8 F D E M W addq %r11, %r8 F D E M W
forwarding logic split execute/writeback decode/execute fetch/decode MUX MUX add 2 ADD ADD 0xF R[srcB] PC R[srcA] next R[dstE] next R[dstM] dstE dstM srcB srcA register fjle Mem. Instr. 4
some forwarding paths addq %r8, %r9 subq %r9, %r11 rmmovq %r9, 8(%r11) 5 cycle # 0 1 2 3 4 5 6 7 8 F D E M W F D E M W mrmovq 4(%r11), %r10 F D E M W F D E M W xorq %r10, %r9 F D E M W
forwarding in HCL register dE { valA : 64 = 0; dstE : 4 = 0; }; ... /* was: d_valA = reg_outputA; */ d_valA = [ reg_srcA == e_dstE : e_valE; ... 1 : reg_outputA; ]; d_dstE = ...; 6
forwarding in HCL register dE { valA : 64 = 0; dstE : 4 = 0; }; ... /* was: d_valA = reg_outputA; */ d_valA = [ reg_srcA == e_dstE : e_valE; ... 1 : reg_outputA; ]; d_dstE = ...; 6
forwarding in HCL register dE { valA : 64 = 0; dstE : 4 = 0; }; ... /* was: d_valA = reg_outputA; */ d_valA = [ ... 1 : reg_outputA; ]; d_dstE = ...; 6 reg_srcA == e_dstE : e_valE;
unsolved problem subq %rbx, %rcx subq %rbx, %rcx stall 7 cycle # 0 1 2 3 4 5 6 7 8 mrmovq 0(%rax), %rbx F D E M W F D E M W F F D E M W
unsolved problem subq %rbx, %rcx subq %rbx, %rcx stall 7 cycle # 0 1 2 3 4 5 6 7 8 mrmovq 0(%rax), %rbx F D E M W F D E M W F F D E M W
multiple forwarding paths addq %r10, %r8 addq %r11, %r8 addq %r12, %r8 8 cycle # 0 1 2 3 4 5 6 7 8 F D E M W F D E M W F D E M W
8 multiple forwarding paths cycle # 0 1 2 3 4 5 6 7 8 addq %r10, %r8 F D E M W addq %r11, %r8 F D E M W addq %r12, %r8 F D E M W
multiple forwarding HCL d_valA = [ ... reg_srcA == e_dstE : e_valE; reg_srcA == m_dstE : m_valE; ... 1 : reg_outputA; ]; 9
multiple forwarding paths (2) addq %r10, %r8 addq %r11, %r12 addq %r12, %r8 10 cycle # 0 1 2 3 4 5 6 7 8 F D E M W F D E M W F D E M W
multiple forwarding paths (2) addq %r11, %r12 10 cycle # 0 1 2 3 4 5 6 7 8 addq %r10, %r8 F D E M W F D E M W addq %r12, %r8 F D E M W
multiple forwarding paths (2) addq %r10, %r8 10 cycle # 0 1 2 3 4 5 6 7 8 F D E M W addq %r11, %r12 F D E M W addq %r12, %r8 F D E M W
after forwarding/prediction where do we still need to stall? memory output needed in fetch ret followed by anything memory output needed in exceute mrmovq or popq + use (in immediatelly following instruction) 11
overall CPU 5 stage pipeline most data hazards: solved by forwarding load/use hazard: 1 cycle of stalling jXX control hazard: branch prediction + squashing 2 cycle penalty for misprediction ret control hazard: 3 cycles of stalling 12 1 instruction completes every cycle — except hazards
pipelined control costs how much faster than single-cycle processor? at most fjve times faster depends on hardware details does added logic make clock cycle slower? depends on what programs we run: how many mispredicted jumps? how many rets? how many load/use hazards? 13
hazards versus dependencies dependency — X needs result of instruction Y? hazard — will it not work in some pipeline? before extra work is done to “resolve” hazards like forwarding or stalling or branch prediction 14
ex.: dependencies and hazards (1) %rcx, which are resolved with forwarding? which are hazards in our pipeline? where are dependencies? %r10 %rbx, addq %r10 addq addq %rcx $100, irmovq %rcx %rax, subq %rbx %rax, 15
ex.: dependencies and hazards (1) %rcx, which are resolved with forwarding? which are hazards in our pipeline? where are dependencies? %r10 %rbx, addq %r10 addq addq %rcx $100, irmovq %rcx %rax, subq %rbx %rax, 15
ex.: dependencies and hazards (1) %rcx, which are resolved with forwarding? which are hazards in our pipeline? where are dependencies? %r10 %rbx, addq %r10 addq addq %rcx $100, irmovq %rcx %rax, subq %rbx %rax, 15
ex.: dependencies and hazards (1) %rcx, which are resolved with forwarding? which are hazards in our pipeline? where are dependencies? %r10 %rbx, addq %r10 addq addq %rcx $100, irmovq %rcx %rax, subq %rbx %rax, 15
ex.: dependencies and hazards (2) %rdx which are resolved with forwarding? which are hazards in our pipeline? where are dependencies? foo: %rcx (%rdx) mrmovq %rcx mrmovq addq foo jne %rcx %rbx addq 16 0(%rax) %rbx
pipeline with difgerent hazards xorq %rax, %r10 addq/andq is not a hazard with 4-stage pipeline addq/andq is hazard with 5-stage pipeline // D // D %r11 // E // EM // M example: 4-stage pipeline: // W subq %rax, %r9 // W // // 5 stage // 4 stage fetch/decode/execute+memory/writeback 17 addq %rax, %r8 andq %r8,
pipeline with difgerent hazards xorq %rax, %r10 addq/andq is not a hazard with 4-stage pipeline addq/andq is hazard with 5-stage pipeline // D // D %r11 andq %r8, // E // EM // M example: 4-stage pipeline: // W subq %rax, %r9 // W // addq %rax, %r8 // 5 stage // 4 stage fetch/decode/execute+memory/writeback 17
exercise: difgerent pipeline F addq %r9, %rbx F D D E1 E2 M W addq %rax, %r9 D E1 E2 M D E1 E2 M W addq %rax, %r9 F F D E1 E2 M W W F split execute into two stages: F/D/E1/E2/M/W 5 cycle # 0 1 2 3 4 6 addq %r9, %rbx 7 8 addq %rcx, %r9 F D E1 E2 M W 18
exercise: difgerent pipeline F addq %r9, %rbx F D D E1 E2 M W addq %rax, %r9 D E1 E2 M D E1 E2 M W addq %rax, %r9 F F D E1 E2 M W W F split execute into two stages: F/D/E1/E2/M/W 5 cycle # 0 1 2 3 4 6 addq %r9, %rbx 7 8 addq %rcx, %r9 F D E1 E2 M W 18
exercise: difgerent pipeline F addq %r9, %rbx F D D E1 E2 M W addq %rax, %r9 D E1 E2 M D E1 E2 M W addq %rax, %r9 F F D E1 E2 M W W F split execute into two stages: F/D/E1/E2/M/W 5 cycle # 0 1 2 3 4 6 addq %r9, %rbx 7 8 addq %rcx, %r9 F D E1 E2 M W 18
exercise: difgerent pipeline F addq %r9, %rbx F D D E1 E2 M W addq %rax, %r9 D E1 E2 M D E1 E2 M W addq %rax, %r9 F F D E1 E2 M W W F split execute into two stages: F/D/E1/E2/M/W 5 cycle # 0 1 2 3 4 6 addq %r9, %rbx 7 8 addq %rcx, %r9 F D E1 E2 M W 18
exercise: forwarding paths D F D E M W mrmovq 8(%r9), %r11 F E W M W pushq %r11 F D E M W popq %r10 M cycle # 8 0 1 2 3 4 5 6 7 addq %rcx, %r9 E F D E M W rmmovq %r9, 8(%r8) F D 19
exercise: forwarding paths D F D E M W mrmovq 8(%r9), %r11 F E W M W pushq %r11 F D E M W popq %r10 M cycle # 8 0 1 2 3 4 5 6 7 addq %rcx, %r9 E F D E M W rmmovq %r9, 8(%r8) F D 19
exercise: forwarding paths D F D E M W mrmovq 8(%r9), %r11 F E W M W pushq %r11 F D E M W popq %r10 M cycle # 8 0 1 2 3 4 5 6 7 addq %rcx, %r9 E F D E M W rmmovq %r9, 8(%r8) F D 19
exercise: forwarding paths (alt pipe) W W F DE M pushq %r11 W F DE M mrmovq 8(%r9), %r11 W F DE M popq %r10 W F DE M rmmovq %r9, 8(%r8) F DE M suppose four-stage pipeline: addq %rcx, %r9 8 7 6 5 4 3 2 1 0 cycle # fetch/decode+execute/memory/writeback 20
exercise: forwarding paths (alt pipe) W W F DE M pushq %r11 W F DE M mrmovq 8(%r9), %r11 W F DE M popq %r10 W F DE M rmmovq %r9, 8(%r8) F DE M suppose four-stage pipeline: addq %rcx, %r9 8 7 6 5 4 3 2 1 0 cycle # fetch/decode+execute/memory/writeback 20
overall CPU 5 stage pipeline most data hazards: solved by forwarding load/use hazard: 1 cycle of stalling jXX control hazard: branch prediction + squashing 2 cycle penalty for misprediction ret control hazard: 3 cycles of stalling 21 1 instruction completes every cycle — except hazards
pipelined control costs how much faster than single-cycle processor? at most fjve times faster depends on HW details: how expensive is forwarding logic? (new MUXes on critical path) how well balanced are the stages? depends on what programs we run: how many mispredicted jumps? how many rets? how many load/use hazards? 22
HCL2D pipeline registers valA : 64 = 0; valB : 64 = E; dstE : 4 = REG_NONE; /* Writeback */ } valE : 64 = 0; dstE : 4 = REG_NONE; register eW { /* Execute */ } register dE { register xF { /* Decode */ }; rA : 4 = REG_NONE; rB : 4 = REG_NONE; register fD { /* Fetch+PC Update*/ }; pc : 64 = 0; 23
HCL2D: Fetch/Decode pc = F_pc; pipelined d_valB = reg_outputB; d_valA = reg_outputA; dstE = D_rB; reg_srcB = D_rB; reg_srcA = D_rA; /* Decode */ f_rB = i10bytes[8..12]; f_rA = i10bytes[12..16]; x_pc = pc + 2; /* Fetch+PC Update*/ /* Fetch+PC Update*/ unpipelined valB = reg_outputB; valA = reg_outputA; dstE = rB; reg_srcB = rB; reg_srcA = rA; /* Decode */ rB = i10bytes[8..12]; rA = i10bytes[12..16]; x_pc = pc + 2; pc = F_pc; 24
Recommend
More recommend