Implementing out-of-order execution processors IBM 360/91 High - PowerPoint PPT Presentation

Implementing out-of-order execution processors IBM 360/91 High performance substrate CSE240A: Neha Chachra and Bryan S. Kim Feb. 11, 2010 1

Historical perspective Pipeline RISC Superscalar Out-of-order VLIW SMT 1960 1970 1980 1990 2000 1961: IBM Stretch 1964: CDC 6600 1974: Data flow 1980: Berkeley RISC 1992: IBM PowerPC 600 1962: ILLIAC II 1967: IBM 360/91 1976: Cray 1 1981: Stanford MIPS 1995: Intel Pentium Pro 1977: DEC VAX 1983: Yale VLIW 1996: MIPS R10000 1985: Berkeley HPS 1998: DEC Alpha 21264 CSE240A: Neha Chachra and Bryan S. Kim CSE240A: Neha Chachra and Bryan S. Kim Feb. 11, 2010 2 2

Historical Context : CDC6600 • Mainframe computer in 1964 • Superscalar design with 10 parallel functional units • Functional units not pipelined • Instructions fetched and issued faster than execution CSE240A: Neha Chachra and Bryan S. Kim CSE240A: Neha Chachra and Bryan S. Kim Feb. 11, 2010 3 3

Scoreboarding • Scoreboard • A central control to determine dependencies and prevent hazards • Steps: • Issue • Prevents WAW and Structural hazards • Read Operands • Leads to OOO • Execution • Followed by notification to scoreboard • Write result • Checks for WAR CSE240A: Neha Chachra and Bryan S. Kim CSE240A: Neha Chachra and Bryan S. Kim Feb. 11, 2010 4 4

Architecture EX ID FU1 Read EX Write IF Issue operands FU2 results EX FUn Structural hazard: RAW data hazard: wait WAR data hazard: delaying the issue until the values of the delaying the write if a WAW data hazard: source registers are WAR hazard exists delaying the issue available in the registers CSE240A: Neha Chachra and Bryan S. Kim CSE240A: Neha Chachra and Bryan S. Kim Feb. 11, 2010 5 5

Parts of Scoreboard • Instructional status • Indicates which of the 4 steps an instruction is in • Functional unit status • State of functional unit • 9 such states. Eg. busy state • Register result status • Indicates the functional unit that will write each register CSE240A: Neha Chachra and Bryan S. Kim CSE240A: Neha Chachra and Bryan S. Kim Feb. 11, 2010 6 6

Scoreboard Structure CSE240A: Neha Chachra and Bryan S. Kim CSE240A: Neha Chachra and Bryan S. Kim Feb. 11, 2010 7 7

Scoreboarding Limitations • Number of entries in scoreboard • Determines look ahead for independent instructions • Number and types of functional units • Affect structural dependences • Centralized control • Only 1 instruction can be issued at a time • Low throughput • Stalls for WAW and WAR CSE240A: Neha Chachra and Bryan S. Kim CSE240A: Neha Chachra and Bryan S. Kim Feb. 11, 2010 8 8

Historical Context: IBM 360/91 • Tomasulo's Algorithm implemented for the Floating Point operations • It had only 2 functional units: 1 adder and 1 multiplier/ divider • Had only 4 double precision FP registers CSE240A: Neha Chachra and Bryan S. Kim CSE240A: Neha Chachra and Bryan S. Kim Feb. 11, 2010 9 9

Tomasulo's Goals • The design must identify existence of a dependency • It must sequence the instructions correctly • It must allow independent instructions to overlap CSE240A: Neha Chachra and Bryan S. Kim CSE240A: Neha Chachra and Bryan S. Kim Feb. 11, 2010 10 10

Examples RAW Hazard LD F0 FLB1 MD F0 FLB2 • It is a true dependency • Second operation must not proceed until the first one is complete. • F0 cannot be used until the recent operations using it as sink are complete Independent Instruction LD F0, FLB1 MD F2, FLB2 CSE240A: Neha Chachra and Bryan S. Kim CSE240A: Neha Chachra and Bryan S. Kim Feb. 11, 2010 11 11

Tomasulo's Algorithm w.o CDB • Maintaining precedence using control bits on registers (busy bit scheme) for true dependencies • Set control bit when register is a sink • Transmit data to waiting unit when register gets result • Achieving parallelism through use of different registers is programmer's responsibility for WAW and WAR • Meets the dependency goals but not performance goal • There is a stall for data dependences • Programmer resolves false dependences in code CSE240A: Neha Chachra and Bryan S. Kim CSE240A: Neha Chachra and Bryan S. Kim Feb. 11, 2010 12 12

CSE240A: Neha Chachra and Bryan S. Kim CSE240A: Neha Chachra and Bryan S. Kim Feb. 11, 2010 13 13

Reservation Stations • To efficiently utilize execution units during stalls for true dependences • Example: LD F0, D F0=D LD F2, C F2=C LD F4, B F4=B MD F0, E F0 = D * E AD F2, F0 F2 = C + D * E AD F4, A F4 = A + B AD F2, F4 F2 = A + B + C + D * E CSE240A: Neha Chachra and Bryan S. Kim CSE240A: Neha Chachra and Bryan S. Kim Feb. 11, 2010 14 14

Removing False Dependences Common Data Bus • Efficiently moves data to allow concurrency • Every unit that alters a register feeds into CDB • Every unit that requires a register is fed by CDB • These units are recognized by identifier called tag Register Renaming • Tagging is the mechanism • Removes false dependences • WAW is resolved since register keeps track of last operation tag that updated it • WAR is resolved using in-order decode CSE240A: Neha Chachra and Bryan S. Kim CSE240A: Neha Chachra and Bryan S. Kim Feb. 11, 2010 15 15

CSE240A: Neha Chachra and Bryan S. Kim CSE240A: Neha Chachra and Bryan S. Kim Feb. 11, 2010 16 16

Tomasulo’s Algorithm example Example source: “Modern processor design” textbook by John Paul Shen CSE240A: Neha Chachra and Bryan S. Kim CSE240A: Neha Chachra and Bryan S. Kim Feb. 11, 2010 17 17

Details on register renaming • Output dependence (WAW hazard) – Scoreboard: Instruction issue is stalled – Tomasulo: Resolved by changing the pointer to the reservation for pending update • Anti dependence (WAR hazard) – Scoreboard: Write back is stalled – Tomasulo: Resolved by early dispatch with register values CSE240A: Neha Chachra and Bryan S. Kim CSE240A: Neha Chachra and Bryan S. Kim Feb. 11, 2010 18 18

Steps in Tomasulo's Algorithm • Issue • Instruction issued in-order • Issue to the reservation station with the operands or track the FUs that will produce operands • Stall if no reservation station is available • Execute • Instructions are executed when all operands become available • Many instructions executed simultaneously • Write results • Results are written to CDB • CDB writes to registers and reservation stations CSE240A: Neha Chachra and Bryan S. Kim CSE240A: Neha Chachra and Bryan S. Kim Feb. 11, 2010 19 19

Limitations of Tomasulo's Algorithm • The number of CDBs limits bandwidth • Increasing CDBs increases complexity and cost • Hard to debug because of imprecise interrupts Dynamic scheduling with in-order commit HPS CSE240A: Neha Chachra and Bryan S. Kim CSE240A: Neha Chachra and Bryan S. Kim Feb. 11, 2010 20 20

Scoreboard vs. Tomasulo Scoreboard Tomasulo Issue When FU free When RS free Read operands From reg file From reg file, CDB Write operands To reg file To CDB Structural hazards Functional units Reservation stations WAW, WAR hazards Problem No problem Register renaming No Yes Instructions No limit 1 per cycle (per completing CDB) Instructions 1 (per set of read No limit beginning exec ports) Slide source: “Instruction Level Parallelism - Tomasulo” lecture notes by Dean Tullsen CSE240A: Neha Chachra and Bryan S. Kim CSE240A: Neha Chachra and Bryan S. Kim Feb. 11, 2010 21 21

Summary Principles: • In-order execution for RAW hazards • Renaming registers for WAR and WAW Components: • Reservation Stations • Buffer operands for instructions waiting to execute • Virtual registers implementing register renaming • Common Data Bus • Hardware implementation for concurrency with multiple FUs • Use tags for broadcasting data • Allow more than one instruction to reach execution stage simultaneously CSE240A: Neha Chachra and Bryan S. Kim CSE240A: Neha Chachra and Bryan S. Kim Feb. 11, 2010 22 22

Historical perspective revisited Pipeline RISC Superscalar Out-of-order VLIW SMT 1960 1970 1980 1990 2000 1961: IBM Stretch 1964: CDC 6600 1974: Data flow 1980: Berkeley RISC 1992: IBM PowerPC 600 1962: ILLIAC II 1967: IBM 360/91 1976: Cray 1 1981: Stanford MIPS 1995: Intel Pentium Pro 1977: DEC VAX 1983: Yale VLIW 1996: MIPS R10000 1985: Berkeley HPS 1998: DEC Alpha 21264 CSE240A: Neha Chachra and Bryan S. Kim CSE240A: Neha Chachra and Bryan S. Kim Feb. 11, 2010 23 23

HPS as restricted data flow completed! CSE240A: Neha Chachra and Bryan S. Kim CSE240A: Neha Chachra and Bryan S. Kim Feb. 11, 2010 24 24

Requirements for high performance • High degree of HW concurrency available • Well utilized HW concurrency CSE240A: Neha Chachra and Bryan S. Kim CSE240A: Neha Chachra and Bryan S. Kim Feb. 11, 2010 25 25

Instruction set architecture of HPS • Fixed 32 bit instruction length RISC - like • Two operations per instruction VLIW - like – Can be dependent or independent of each other • 16 architectural registers – 4 special registers CISC - like – 4 safe registers – 8 unsafe registers CSE240A: Neha Chachra and Bryan S. Kim CSE240A: Neha Chachra and Bryan S. Kim Feb. 11, 2010 26 26

Implementing out-of-order execution processors IBM 360/91 High - PowerPoint PPT Presentation

Implementing out-of-order execution processors IBM 360/91 High performance substrate CSE240A: Neha Chachra and Bryan S. Kim Feb. 11, 2010 1 Historical perspective Pipeline RISC Superscalar Out-of-order VLIW SMT 1960 1970 1980 1990

Exploiting Out-of-Order-Execution Processor Side Channels to Enable Cross VM Code Execution Sophia

Reorder Buffer Method Issue Execute Write Classic 5-stage pipeline In-order In-order

Order Is A Lie Are you sure you know how your code runs ? Order in code is not respected by

MASTERING STRATEGY EXECUTION 18 BEST PRACTICES FOR STRATEGY EXECUTION STRATEGY EXECUTION AS

Out-of-Order Execution Several implementations out-of-order completion CDC 6600 with

Precise Exceptions and Out-of-Order Execution Samira Khan Multi-Cycle Execution Not all

Out- -of of- -Order Order Out Tomasulos Algorithm Superscalar CPU Superscalar CPU -

Out- -of of- -Order Order Out Superscalar CPU Superscalar CPU Cliff Frey and Vicky Liu May

Lect. 4: Shared Memory Multiprocessors Obtained by connecting full processors together

execution states with swapping Processes, Execution, and State 3F. Execution State Model exit

Commission: Out of touch, out of date, out of pocket April 2017 Commission: Out of touch, out of

21264 vs NetBurst Two Different Processors- Both Nonexistent CSE 240C - Rushi Chakrabarti - WI09

Implementing Perl 6 Jonathan Worthington Dutch Perl Workshop 2008 Implementing Perl 6 I

61A Extra Lecture 6 Implementing an Object System 3 Implementing an Object System Today's

Synchronization 2: Locks (part 2), Mutexes 1 load/store reordering recall: out-of-order

CS 105 Intel x86 (IA32/64) Processors Intel x86 (IA32/64) Processors Tour of the Black Holes

UNITED STATES PATENT AND TRADEMARK OFFICE _______________ BEFORE THE PATENT TRIAL AND APPEAL

Introduction Why and what for 2253 Levels of abstraction Instruction Set Architectures

Strategic Planning and the Long-Term R&D Plan Jill N. Cooley Department of Safeguards IAEA

CS5460: Operating Systems Lecture: Virtualization Anton Burtsev March, 2013 Traditional

Mental Health A driving application for 4 th Generation Computing GREGORY D. ABOWD GEORGIA TECH

CS307&CS356: Operating Systems Dept. of Computer Science & Engineering Chentao Wu

CS 3410 Computer Science Cornell University [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]

CENG4480 Lecture 01: Introduction Bei Yu byu@cse.cuhk.edu.hk (Latest update: September 11,

Implementing out-of-order execution processors IBM 360/91 High - PowerPoint PPT Presentation

Implementing out-of-order execution processors IBM 360/91 High performance substrate CSE240A: Neha Chachra and Bryan S. Kim Feb. 11, 2010 1 Historical perspective Pipeline RISC Superscalar Out-of-order VLIW SMT 1960 1970 1980 1990

Exploiting Out-of-Order-Execution Processor Side Channels to Enable Cross VM Code Execution Sophia

Reorder Buffer Method Issue Execute Write Classic 5-stage pipeline In-order In-order

Order Is A Lie Are you sure you know how your code runs ? Order in code is not respected by

MASTERING STRATEGY EXECUTION 18 BEST PRACTICES FOR STRATEGY EXECUTION STRATEGY EXECUTION AS

Out-of-Order Execution Several implementations out-of-order completion CDC 6600 with

Precise Exceptions and Out-of-Order Execution Samira Khan Multi-Cycle Execution Not all

Out- -of of- -Order Order Out Tomasulos Algorithm Superscalar CPU Superscalar CPU -

Out- -of of- -Order Order Out Superscalar CPU Superscalar CPU Cliff Frey and Vicky Liu May

Lect. 4: Shared Memory Multiprocessors Obtained by connecting full processors together

execution states with swapping Processes, Execution, and State 3F. Execution State Model exit

Commission: Out of touch, out of date, out of pocket April 2017 Commission: Out of touch, out of

21264 vs NetBurst Two Different Processors- Both Nonexistent CSE 240C - Rushi Chakrabarti - WI09

Implementing Perl 6 Jonathan Worthington Dutch Perl Workshop 2008 Implementing Perl 6 I

61A Extra Lecture 6 Implementing an Object System 3 Implementing an Object System Today's

Synchronization 2: Locks (part 2), Mutexes 1 load/store reordering recall: out-of-order

CS 105 Intel x86 (IA32/64) Processors Intel x86 (IA32/64) Processors Tour of the Black Holes

UNITED STATES PATENT AND TRADEMARK OFFICE _______________ BEFORE THE PATENT TRIAL AND APPEAL

Introduction Why and what for 2253 Levels of abstraction Instruction Set Architectures

Strategic Planning and the Long-Term R&amp;D Plan Jill N. Cooley Department of Safeguards IAEA

CS5460: Operating Systems Lecture: Virtualization Anton Burtsev March, 2013 Traditional

Mental Health A driving application for 4 th Generation Computing GREGORY D. ABOWD GEORGIA TECH

CS307&amp;CS356: Operating Systems Dept. of Computer Science &amp; Engineering Chentao Wu

CS 3410 Computer Science Cornell University [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]

CENG4480 Lecture 01: Introduction Bei Yu byu@cse.cuhk.edu.hk (Latest update: September 11,

Strategic Planning and the Long-Term R&D Plan Jill N. Cooley Department of Safeguards IAEA

CS307&CS356: Operating Systems Dept. of Computer Science & Engineering Chentao Wu