CS 152: Discussion Section 6 Out-of-Order Execution Albert Ou, Yue - PowerPoint PPT Presentation

CS 152: Discussion Section 6 Out-of-Order Execution Albert Ou, Yue Dai 03/06/2020

Administrivia Lab 2 due 10:30am on Mon, March 9 ● Problem Set 3 due 10:30am on Mon, March 16 ● Midterm 1 scores will be available on Gradescope on Wed, March 11 ● One week to submit regrade requests ○ Note: Regrades invite further scrutiny (score might increase or ○ decrease)

Post-Midterm Poll What topics should we cover in future discussions? ● Better scheduling of office hours? ●

Agenda Precise exceptions review ● Underappreciated (yet vital) component of HW/SW contract ○ Recurring concept in OoO context ○ Register renaming ● Cannot understand OoO without understanding this ○ Tomasulo’s algorithm ●

Precise Exception Model Q: Should instruction A be ● insn A killed in pipeline flush if insn B interrupt instruction B has committed? insn C Q: What should EPC point to? ● insn D

Why are Precise Exceptions Useful? Restartable Not all traps terminate a program ● Page faults, syscalls, etc. ○ Well-defined architectural state simplifies returning from exception ● Resume execution by jumping back to EPC (or EPC+4) ○ No visible side effects from partial execution → ○ no need to save/restore microarchitectural state

Why are Precise Exceptions Useful? Deterministic Valuable for reproducibility and debugging ● Easy to identify the exact instruction that faulted ● Program state (registers, coredump, commit trace) matches mental ● model that programmers have about sequential execution

Why are Precise Exceptions Problematic? Microarchitectural complexity Must preserve enough information for hardware to recover ● architectural state and repair internal state Checkpointing rename tables ○ In-order commit requirement can limit performance ● Head-of-line blocking in ROB ○ Difficult to avoid partial side effects for more complex instructions ● Vector memory operations ○

Why is Out-of-Order Execution Useful? Exploit instruction-level parallelism (ILP) to keep processor busy ● Make suboptimal code run fast ○ Dynamically schedule around long-latency instructions ● ld x2, 0(x1) # cache miss: 200 cycles add x5, x3, x4 ld x7, 4(x6) Initiate long-latency instructions earlier ●

What Limits OoO Performance? A: fmul f1, f0, f2 B: fadd f0, f3, f1 C: fmul f3, f2, f3 D: fadd f3, f3, f1 Want to issue instruction C right after A, but cannot reorder it earlier ● due to WAR hazard on B ( f3 ) Suppose only four F registers exist, and it is not feasible for compiler to ● choose f2 as the destination of C since f2 is read by a later instruction

What Limits OoO Performance? WAW/WAR hazards ● Caused by reuse of limited set of architectural (named) registers ○ Would not exist if an infinite number of registers were available ○ Not a “true” data dependency ○ How can x86 (8 “GPRs”) and x86-64 (16 GPRs) implementations ● achieve high performance? How can we use more registers than what the ISA specifies? ●

Register Renaming Main idea: Decouple architectural registers (used for expressing ● dataflow) from physical registers (used for storage) For each in-flight instruction, rename the destination register with ○ a unique tag that refers to a separate buffer to hold result Somehow maintain relationship between tags and ISA registers ○ “All problems in computer science can be solved by another level of ● indirection” - David Wheeler, inventor of the subroutine call

Register Renaming Rename Table Initial Final A: fmul f1, f0, f2 fmul P4, P0, P2 f0 P0 P5 B: fadd f0, f3, f1 fadd P5, P3, P4 C: fmul f3, f2, f3 fmul P6, P2, P3 f1 P1 P4 D: fadd f3, f3, f1 fadd P7, P6, P4 f2 P2 P2 f3 P3 P7 Resembles single static assignment (SSA) form ●

Tomasulo’s Algorithm (Q1) On instruction dispatch (in program order): ● 1. Allocate reservation station (RS) entry 2. If source register has “present” (P) bit set in register file (RF) entry, copy value into tag/data field in RS and set P bit for operand 3. Otherwise, copy tag from RF into RS and clear P bit for operand 4. Replace RF entry for destination register with tag assigned to RS entry (tag dest ) Prior to execution : ● 1. For missing operands, monitor result bus for tag match; replace tag with value; set P 2. When all operands are present, issue to functional unit On completion: ● 1. Broadcast <tag dest , result> on result bus for RF and other RS entries to consume 2. Deallocate RS entry

Tomasulo’s Algorithm Q : Why can’t the reservation station entry for an instruction be deallocated immediately on issue? A: fmul f4, f0, f1 # Dispatched and issued immediately; RS is freed B: fmul f5, f2, f3 # Allocated same RS as A before A has written back f4 and f5 now assigned the same tag in regfile, causing instruction B to incorrectly clobber f4 on writeback

Tomasulo’s Algorithm Q : Why are exceptions imprecise in this implementation? Register file is irrevocably modified on dispatch ● No mechanism to recover original value of destination register if ● instruction causes an exception

How to Regain Precise Exceptions? Reorder Buffer (ROB) separates commit from completion : v i op p rs1/tag p rs2/tag p result rd xcpt? oldest free Completion : Result available (out-of-order) ● Commit : Architectural state updated (in-order) ●

Data-in-ROB Both tags and data held in ROB, with separate architectural register file

Unified Physical Register File Physical register file holds both committed and temporary values; Only tags held in ROB

Renaming with Unified PRF (Q2) On dispatch : ● 1. Allocate new physical register for destination from free list 2. Update decode-stage mapping On commit : ● 1. Update architectural mapping 2. Deallocate previous physical register for destination; re-add to free list On exception : ● 1. Repair decode-stage rename table by un-renaming in reverse order; walk through ROB entries from newest to oldest (MIPS R10k approach)

CS 152: Discussion Section 6 Out-of-Order Execution Albert Ou, Yue - PowerPoint PPT Presentation

CS 152: Discussion Section 6 Out-of-Order Execution Albert Ou, Yue Dai 03/06/2020 Administrivia Lab 2 due 10:30am on Mon, March 9 Problem Set 3 due 10:30am on Mon, March 16 Midterm 1 scores will be available on Gradescope on Wed,

Area of Rectangles 2 Return to Table of Contents 3 Slide 7 / 152 Slide 8 / 152 Area of a

Decimal Addition Return to Table of Contents Slide 5 / 152 Place Value Chart Slide 6 / 152

Area of Rectangles MP6: Attend to precision. MP7: Look for & make use of structure. MP8:

Module V: Vector Spaces Module V Math 237 Module V Section V.0 Section V.1 Section V.2

Half Year Results Presentation 2019 6 months ended 30 June 2019 Section 1 Section 2 Section 3

2018 Full year results presentation 12 months ended 31 December 2018 1 Section 1 Section 2

CS 152: Discussion Section 7 Branch Predictor and VLIW Albert Ou, Yue Dai 03/013/2020

Download Worksheet 10 Please mute yourself when not asking questions CS 152: Discussion Section

CS 152: Discussion Section 2 Pipelining Review Yue Dai, Albert Ou 02/07/2020 Administrivia PS1

Osseo Road reconstruction County Road 152 in Minneapolis Project update July 23, 2020 Amber

CS 152 Computer Architecture and Engineering Lecture 12: Multicycle Controller Design October 10,

Final exam location: Clough 152 Please fill out your CIOS survey! Post topics for

How to Give a Bad Talk How to Give a Bad Talk Professor David A. Patterson Computer Science 152

May 2013 Agenda Section 1 Jaypee Group Overview Section 2 Company Overview Section 3 Yamuna

Fermilab NORTH 0 20 20 40 1"=20'-0" 2/8/2019 6:57:50 PM 4850 LEVEL SCALE SC LE

Module A: Algebraic properties of linear maps Module A Math 237 Module A Section A.1 Section

Chunk-level Reordering of Source Language Sentences with Automatically Learned Rules for

A Hybrid Multithreaded Direct Sparse Triangular Solver Andrew M. Bradley Thanks: E. Boman, C.

Mixed models in R using the lme4 package Part 2: lattice graphics in R Douglas Bates Merck,

Reordering Philipp Koehn 31 October 2017 Philipp Koehn Machine Translation: Reordering 31

SAM: Optimizing Multithreaded Cores for Speculative Parallelism MA MALEEN ABEYDEERA, SUVINAY

CIS 371 Computer Organization and Design Unit 11: Static and Dynamic Scheduling Slides

INSTRUCTION LEVEL PARALLELISM Mahdi Nazm Bojnordi Assistant Professor School of Computing

SAM: Optimizing Multithreaded Cores for Speculative Parallelism MALEEN ABEYDEERA, SUVINAY

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

CS 152: Discussion Section 6 Out-of-Order Execution Albert Ou, Yue - PowerPoint PPT Presentation

CS 152: Discussion Section 6 Out-of-Order Execution Albert Ou, Yue Dai 03/06/2020 Administrivia Lab 2 due 10:30am on Mon, March 9 Problem Set 3 due 10:30am on Mon, March 16 Midterm 1 scores will be available on Gradescope on Wed,

Area of Rectangles 2 Return to Table of Contents 3 Slide 7 / 152 Slide 8 / 152 Area of a

Decimal Addition Return to Table of Contents Slide 5 / 152 Place Value Chart Slide 6 / 152

Area of Rectangles MP6: Attend to precision. MP7: Look for &amp; make use of structure. MP8:

Module V: Vector Spaces Module V Math 237 Module V Section V.0 Section V.1 Section V.2

Half Year Results Presentation 2019 6 months ended 30 June 2019 Section 1 Section 2 Section 3

2018 Full year results presentation 12 months ended 31 December 2018 1 Section 1 Section 2

CS 152: Discussion Section 7 Branch Predictor and VLIW Albert Ou, Yue Dai 03/013/2020

Download Worksheet 10 Please mute yourself when not asking questions CS 152: Discussion Section

CS 152: Discussion Section 2 Pipelining Review Yue Dai, Albert Ou 02/07/2020 Administrivia PS1

Osseo Road reconstruction County Road 152 in Minneapolis Project update July 23, 2020 Amber

CS 152 Computer Architecture and Engineering Lecture 12: Multicycle Controller Design October 10,

Final exam location: Clough 152 Please fill out your CIOS survey! Post topics for

How to Give a Bad Talk How to Give a Bad Talk Professor David A. Patterson Computer Science 152

May 2013 Agenda Section 1 Jaypee Group Overview Section 2 Company Overview Section 3 Yamuna

Fermilab NORTH 0 20 20 40 1&quot;=20'-0&quot; 2/8/2019 6:57:50 PM 4850 LEVEL SCALE SC LE

Module A: Algebraic properties of linear maps Module A Math 237 Module A Section A.1 Section

Chunk-level Reordering of Source Language Sentences with Automatically Learned Rules for

A Hybrid Multithreaded Direct Sparse Triangular Solver Andrew M. Bradley Thanks: E. Boman, C.

Mixed models in R using the lme4 package Part 2: lattice graphics in R Douglas Bates Merck,

Reordering Philipp Koehn 31 October 2017 Philipp Koehn Machine Translation: Reordering 31

SAM: Optimizing Multithreaded Cores for Speculative Parallelism MA MALEEN ABEYDEERA, SUVINAY

CIS 371 Computer Organization and Design Unit 11: Static and Dynamic Scheduling Slides

INSTRUCTION LEVEL PARALLELISM Mahdi Nazm Bojnordi Assistant Professor School of Computing

SAM: Optimizing Multithreaded Cores for Speculative Parallelism MALEEN ABEYDEERA, SUVINAY

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Area of Rectangles MP6: Attend to precision. MP7: Look for & make use of structure. MP8:

Fermilab NORTH 0 20 20 40 1"=20'-0" 2/8/2019 6:57:50 PM 4850 LEVEL SCALE SC LE