credits
play

Credits Some of the material in this presentation is taken from: - PowerPoint PPT Presentation

1 2 Credits Some of the material in this presentation is taken from: Computer Architecture: A Quantitative Approach John Hennessy & David Patterson EE 457 Unit 9a Some of the material in this presentation is derived from


  1. 1 2 Credits • Some of the material in this presentation is taken from: – Computer Architecture: A Quantitative Approach • John Hennessy & David Patterson EE 457 Unit 9a • Some of the material in this presentation is derived from course notes and slides from – Prof. Michel Dubois (USC) – Prof. Murali Annavaram (USC) Exploiting ILP – Prof. David Patterson (UC Berkeley) Out-of-Order Execution 3 4 Outline • _________________ Parallelism – In-order (Io) pipeline • From academic 5-stage pipeline • To 8-stage MIPS R4000 pipeline • Superscalar, superpipelined – Out-of-Order (OoO) Execution Other In-Order techniques • OoO Execution AND Out-of-order completion (Problem: Exceptions) • OoO Execution BUT In-order completion SUPERSCALAR & SUPERPIPELINING • ________________ Parallelism – Chip ______________ (CMT) – Chip ______________ (CMP)

  2. 5 6 Overview 2-way Superscalar • Superscalar = More than 1 instruction ___________________________ • One ALU & Data transfer (LW/SW) instruction can be issued at the same time – ______uperscalar = Proc. that can issue 2 instructions per clock cycle Instruction Pipeline Stages – Success is sensitive to ability to find independent instructions to issue in the same cycle ALU or branch IF ID EX MEM WB • Superpipelining = Many small stages to boost _________________ LW/SW IF ID EX MEM WB – Success depends of finding instructions to schedule in the shadow of data and control hazards ALU or branch IF ID EX MEM WB LW/SW IF ID EX MEM WB ALU or branch IF ID EX MEM WB Instruction Instruc. Instruc. Data LW/SW IF ID EX MEM WB Superscalar Execute Write back 1 Fetch Decode Memory Integer Slot Instruction PC Instruc. Instruc. Data Execute Write back 2 Fetch Decode Memory ALU Reg. Superscalar: Executing more than 1 instruction per clock cycle (CPI < 1) File Superpipelining I-Cache Instruction (_ Read, IF1 IF2 ID EX DM1 DM2 DM3 WB Addr. 1 LD/ST Slot D-Cache _ Write) Calc. Instruction IF1 IF2 ID EX DM1 DM2 DM3 WB 2 instructions 2 Superpipelining: Divide logic into many short stages (______ Clock Frequency) 7 8 Instruction Level Parallelism (ILP) • Although a program defines a sequential ordering of instructions, in reality many instructions can be executed in parallel. • ILP refers to the process of finding instructions from a single program/thread of execution that can be executed in parallel • ________________________ is what truly limits ordering • _____________ instructions (no data dependencies) can be executed at the same time) • _____________________ also provide some ordering constraints lw $s3,0($s4) OUT-OF-ORDER EXECUTION and $t3,$t2,$t3 Program add $t0,$t0,$s4 Dependency Order or $t5,$t3,$t2 (In-order) Graph sub $t1,$t1,$t2 beq $t0,$t8,L1 We may perform xor $s0,$t1,$s2 execution out-of-order Cycle 1: / / / Cycle 2: / / / Cycle 3: / / /

  3. 9 10 Basic Blocks Out-of-Order Motivation • Basic Block (def.) = Sequence of instructions that will • Hide the impact of dynamic events such as a always be ________________ ______________ – No __________________ out lw $s3,0($s4) • Out-of-Order (OoO) Execution and $t3,$t2,$t3 – No branch targets coming ____ This is a L1: add $t0,$t0,$s4 basic block or $t5,$t3,$t2 – Let ________________ instructions behind a stalled (starts w/ sub $t1,$t1,$t2 – Also called “straight-line” code target, ends beq $t0,$t8,L1 instruction execute with branch) xor $s0,$t1,$s2 – Average size: _____ instrucs. – Important aspect: Completion Ordering • Instructions in a basic block can be overlapped if • Out-of-Order completion: Let the independent instruction that has there are no data dependencies been executed ____________________________________ before the stalled instruction completes • ________ dependences really ________________ of – Problem: Exception handling possible instructions to overlap • In-Order completion: Let the independent instructions execute but ______________ their results until the stalled instruction – W/o extra hardware, we can only overlap execution of completes instructions within a basic block 11 12 Out-of-Order Execution In- or Out-of-Order Completion • “Execution” here means ____________ the results not • IoI/IoD => OoOE => IoC necessarily _____________ them to a register or memory – In-order completion is necessary to support precise exceptions [exact state at time of exception] • Completion means _____________________the results to • We will present the concept of OoOC (out-of-order register file or memory completion) which is a bit easier and then come back to the • While we say out-of-order execution we really mean: desired approach of IoC – In-order Issue/Dispatch (IoD) • OoOC Issues Execution – Out-of-Order Execution (OoOE) – _____________…we should not commit an instruction that came after – In-order Completion (IoC) (in program order) a branch Issue/Dispatch Completion – Solution: Stall dispatching instructions LW $4,0($5) // cache miss after a branch until we resolve the BEQ $4,$0,L1 outcome ADD $6,$7,$8 // What if we execute this ADD out of order In-order In-order Out-of-Order

  4. 13 14 Scheduling Strategies Static Scheduling • _____________ Scheduling • Strengths – Hardware simplicity [Better clock rate] – ___________ re-orders instructions in such a way that no • Power/energy advantage dependencies will be violated and allows for OoOE • Compiler has a global view of the program anyway, so it should be able to • ____________ Scheduling do a “good” job – Very predictable: static performance predictions are reliable – ______ implementing the Tomasulo algorithm or other similar • Weaknesses approach will re-order instructions to allow for OoOE – Requires _______________ to take advantage of new/modified • More Advanced Concepts architecture – Branch prediction and speculative execution (execution beyond – Cannot foresee dynamic (data-dependent) events a branch flushing if incorrect) will be covered later • Cache miss, conditional branches (can only recedule instructions in a basic block) – Cannot precompute memory addresses – No good solution for precise exceptions with out-of-order completion 15 16 Where to Stall? Where to Stall? • Simple 5-stage pipeline: • In 5-stage pipeline (in-order execution) RAW – Dependent instructions cannot be stalled in the EX stage dependency was solved by 0 – ______________________ or 1 FLUSH PCWrite Mem WB HDU IRWrite 0 Mem WB 0 1 – ______________ Stall IF.Flush 0 Why? What if ADD was also WB 0 Ex 1 dependent on the instruction in MemToReg Control Branch 4 ____… ADD has no place to + • Dependent instructions stalled in the ID stage rs Read + ________ that forwarded value Sh. MemRead & Reg. 1 # MemWrite Left 5 2 Pipeline Stage Register Pipeline Stage Register Instruction Register rt Read 0 Thus we stall in ID so we can Read if necessary 1 Reg. 2 # Pipeline Stage Register data 1 use the ______________ to 5 0 2 I-Cache grab dependent values. Further . Write Zero PC ALUSelA Reg. # ALU stalling in ID incurs only 1 cycle Res. Read penalty as would stalling in EX. Write 0 0 data 2 D-Cache Data 1 1 1 2 Register File ADD $1,$3,$4 stall LW $4 Data Mem. or ALU result Sign ALUSelB ALUSrc Extend Reset 16 32 Forwarding 0 Unit DM rs IM ALU Reg Reg Prior ALU 1 rt Result rd Regwrite & Regwrite, WriteReg# WriteReg#

  5. 17 18 Where to Stall? Forwarding in OoO Execution • In 5-stage pipeline later instructions carried their source register IDs into the • But to implement OoO execution, we ________________ in the decode EX stage to be compared with _____________ register ID’s of their _______ stage since that would prevent any further issuing of instructions instructions • Thus, now we will issue to queues for each of the multiple functional units • But in OoO execution, we may have ______ (earlier) instructions in front of and have the instruction stall in the queue until it is ready us and cannot afford to perform so many comparisons (as well as handling the case where many earlier instructions are producing new version of a Queues + register) ALU Functional Units • Instead, the dispatch unit will explicitly tell the dependent instruction who to ____________ data from mWrite Read & 2 Pipeline Stage Register Pipeline Stage Register MUL 0 Read 1 data 1 Pipeline Stage Register 0 2 Zero ALUSelA ALU IM DM Res. Reg Reg Read 0 0 data 2 D-Cache 1 1 1 2 File DIV Data Mem. or ALU result ALUSelB ALUSrc 32 Forwarding 0 Unit rs Stalling here would _______ up Prior ALU 1 rt Result Addr the pipeline rd Regwrite & Regwrite, WriteReg# Calc. WriteReg# 19 20 Tomasulo’s Plan OoO Execution Problems • OoO Execution • For the time – No branch prediction • Multiple functional units – No speculative execution beyond a branch – Integer ALU, Data memory, Multiplier, Divider • So we simply stall on a conditional branch • Queues between ID and EX stages (in place of ID/EX register) • For the time, no support for precise exceptions – Allows later instructions to keep issuing even if earlier ones are stalled – Even then what about hazards…

Recommend


More recommend