Superscalar Organization Instructor: Nima Honarmand Spring 2015 :: - PowerPoint PPT Presentation

Spring 2015 :: CSE 502 – Computer Architecture Superscalar Organization Instructor: Nima Honarmand

Spring 2015 :: CSE 502 – Computer Architecture Instruction-Level Parallelism (ILP) • Recall: “Parallelism is the number of independent tasks available” • ILP is a measure of inter-dependencies between insns. • Average ILP = num. instruction / num. cyc required code1: ILP = 1 i.e. must execute serially code2: ILP = 3 i.e. can execute at the same time r1  r2 + 1 r1  r2 + 1 code1: code2: r3  r9 / 17 r3  r1 / 17 r4  r0 - r10 r4  r0 - r3

Spring 2015 :: CSE 502 – Computer Architecture ILP != IPC • ILP usually assumes – Infinite resources – Perfect fetch – Unit-latency for all instructions • ILP is a property of the program dataflow • IPC is the “real” observed metric – How many insns. are executed per cycle • ILP is an upper-bound on the attainable IPC – Specific to a particular program

Spring 2015 :: CSE 502 – Computer Architecture Purported Limits on ILP Weiss and Smith [1984] 1.58 Sohi and Vajapeyam [1987] 1.81 Tjaden and Flynn [1970] 1.86 Tjaden and Flynn [1973] 1.96 Uht [1986] 2.00 Smith et al. [1989] 2.00 Jouppi and Wall [1988] 2.40 Johnson [1991] 2.50 Acosta et al. [1986] 2.79 Wedig [1982] 3.00 Butler et al. [1991] 5.8 Melvin and Patt [1991] 6 Wall [1991] 7 Kuck et al. [1972] 8 Riseman and Foster [1972] 51 Nicolau and Fisher [1984] 90

Spring 2015 :: CSE 502 – Computer Architecture ILP Limits of Scalar Pipelines (1) • Scalar upper bound on throughput – Limited to CPI >= 1 – Solution: superscalar pipelines with multiple insns at each stage Prefetch Decode1 U-pipe V-pipe Decode2 Decode2 Execute Execute Pentium Pipeline Writeback Writeback

Spring 2015 :: CSE 502 – Computer Architecture ILP Limits of Scalar Pipelines (2) • Inefficient unified IF • • • pipeline ID • • • – Lower resource utilization and longer RD • • • instruction latency EX ALU MEM1 FP1 BR – Solution: diversified pipelines MEM2 FP2 FP3 WB • • •

Spring 2015 :: CSE 502 – Computer Architecture ILP Limits of Scalar Pipelines (3) • Rigid pipeline stall IF • • • policy ID • • • – A stalled RD instruction stalls • • • ( in order ) Dispatch all newer Buffer ( out of order ) instructions EX ALU MEM1 FP1 BR – Solution 1: MEM2 FP2 out-of-order execution FP3 ( out of order ) Reorder Buffer ( in order ) WB • • •

Spring 2015 :: CSE 502 – Computer Architecture ILP Limits of Scalar Pipelines (3) • Rigid pipeline stall Fetch policy Instruction Buffer In – A stalled Decode Program instruction stalls Order Dispatch Buffer all newer Dispatch instructions Issuing Buffer – Solution 1: Out Execute of out-of-order Order Completion Buffer execution Complete – Solution 2: inter- In Program stage buffers Store Buffer Order Retire

Spring 2015 :: CSE 502 – Computer Architecture ILP Limits of Scalar Pipelines (4) • Instruction dependencies limit parallelism – Frequent stalls due to data and control dependencies – Solution 1: renaming – for WAR and WAW register dependences – Solution 2: speculation – for control dependences and memory dependences

Spring 2015 :: CSE 502 – Computer Architecture ILP Limits of Scalar Pipelines (Summary) 1. Scalar upper bound on throughput – Limited to CPI >= 1 – Solution: superscalar pipelines with multiple insns at each stage 2. Inefficient unified pipeline – Lower resource utilization and longer instruction latency – Solution: diversified pipelines 3. Rigid pipeline stall policy – A stalled instruction stalls all newer instructions – Solution: out-of-order execution and inter-stage buffers 4. Instruction dependencies limit parallelism – Frequent stalls due to data and control dependencies – Solutions: renaming and speculation State of the art: Out-of-Order Superscalar Pipelines

Spring 2015 :: CSE 502 – Computer Architecture Overall Picture • Fetch issues: – Fetch multiple isns I-cache – Branches Instruction – Branch target mis-alignment Branch FETCH Flow Predictor Instruction • Decode issues: Buffer DECODE – Identify insns – Find dependences Memory Integer Floating-point Media • Execution issues: – Dispatch insns Memory – Resolve dependences Data – Bypass networks Flow EXECUTE – Multiple outstanding memory Reorder accesses Buffer Register (ROB) • Completion issues: Data COMMIT Flow D-cache Store – Out-of-order completion Queue – Speculative instructions – Precise exceptions State of the art: Out-of-Order Superscalar Pipelines

Superscalar Organization Instructor: Nima Honarmand Spring 2015 :: - PowerPoint PPT Presentation

Spring 2015 :: CSE 502 Computer Architecture Superscalar Organization Instructor: Nima Honarmand Spring 2015 :: CSE 502 Computer Architecture Instruction-Level Parallelism (ILP) Recall: Parallelism is the number of independent

Out- -of of- -Order Order Out Tomasulos Algorithm Superscalar CPU Superscalar CPU -

Out- -of of- -Order Order Out Superscalar CPU Superscalar CPU Cliff Frey and Vicky Liu May

A Fault Tolerant Superscalar Processor 1 [Based on Coverage of a Microarchitecture-level

1 Register Renaming Examples Register Mapping Status Loop: Renamed dynamic instructions: R1

Superscalar Processors Raul Queiroz Feitosa Parts of these slides are from the support material

Sequential Presentation Of Long Instructions Limits of pipelining, The case for superscalar,

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Jessica H. Tseng

CIS 371 Computer Organization and Design Unit 9: Superscalar Pipelines Slides developed by Milo

FabScalar RISC-V Rangeen Basu Roy Chowdhury Anil Kumar Kannepalli Eric Rotenberg FabScalar

Caches Out-of-order execution Data flow model Samira Khan Superscalar processor March

Superscalar Design: Instruction Flow Techniques Virendra Singh Associate Professor C omputer A

Superscalar Pipelines Slides developed by Joe Devietti, Milo Martin & Amir Roth at U. Penn

Task Superscalar: Using Processors as Functional Units Yoav Etsion Alex Ramirez Rosa M.

CSC2/458 Parallel and Distributed Systems Automatic Parallelization in Hardware Sreepathi Pai

Superscalar Design: An Introduction Virendra Singh Associate Professor C omputer A rchitecture

Superscalar Organization Nima Honarmand Spring 2018 :: CSE 502 Review: Instruction-Level

Collaborators: Lee Armus, Danny Dale, Tanio Diaz-Santos, Chris Hayward, Alex Pope, Anna Sajina,

Hidden surface removal Visibility of primitives Clipping algorithms will discard objects or

Beyond binary classification Subhransu Maji CMPSCI 689: Machine Learning 19 February 2015

DLSS 2.0 IMAGE RECONSTRUCTION FOR REAL-TIME RENDERING WITH DEEP LEARNING Shiqiu (Edward) Liu,

Limits of Superscalar Architecture Virendra Singh Associate Professor Computer Architecture and

CPI < 1 Pipelined CPUs may have multiple execution units of different types (to

CENG3420 Lecture 12: Instruction-Level Parallelism Bei Yu byu@cse.cuhk.edu.hk (Latest update:

Proving Skipping Refinement with ACL2s Mitesh Jain and Pete Manolios Northeastern University

Sambuz

Useful Links

Newsletter

Mail Us

Superscalar Organization Instructor: Nima Honarmand Spring 2015 :: - PowerPoint PPT Presentation

Spring 2015 :: CSE 502 Computer Architecture Superscalar Organization Instructor: Nima Honarmand Spring 2015 :: CSE 502 Computer Architecture Instruction-Level Parallelism (ILP) Recall: Parallelism is the number of independent

Out- -of of- -Order Order Out Tomasulos Algorithm Superscalar CPU Superscalar CPU -

Out- -of of- -Order Order Out Superscalar CPU Superscalar CPU Cliff Frey and Vicky Liu May

A Fault Tolerant Superscalar Processor 1 [Based on Coverage of a Microarchitecture-level

1 Register Renaming Examples Register Mapping Status Loop: Renamed dynamic instructions: R1

Superscalar Processors Raul Queiroz Feitosa Parts of these slides are from the support material

Sequential Presentation Of Long Instructions Limits of pipelining, The case for superscalar,

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Jessica H. Tseng

CIS 371 Computer Organization and Design Unit 9: Superscalar Pipelines Slides developed by Milo

FabScalar RISC-V Rangeen Basu Roy Chowdhury Anil Kumar Kannepalli Eric Rotenberg FabScalar

Caches Out-of-order execution Data flow model Samira Khan Superscalar processor March

Superscalar Design: Instruction Flow Techniques Virendra Singh Associate Professor C omputer A

Superscalar Pipelines Slides developed by Joe Devietti, Milo Martin &amp; Amir Roth at U. Penn

Task Superscalar: Using Processors as Functional Units Yoav Etsion Alex Ramirez Rosa M.

CSC2/458 Parallel and Distributed Systems Automatic Parallelization in Hardware Sreepathi Pai

Superscalar Design: An Introduction Virendra Singh Associate Professor C omputer A rchitecture

Superscalar Organization Nima Honarmand Spring 2018 :: CSE 502 Review: Instruction-Level

Collaborators: Lee Armus, Danny Dale, Tanio Diaz-Santos, Chris Hayward, Alex Pope, Anna Sajina,

Hidden surface removal Visibility of primitives Clipping algorithms will discard objects or

Beyond binary classification Subhransu Maji CMPSCI 689: Machine Learning 19 February 2015

DLSS 2.0 IMAGE RECONSTRUCTION FOR REAL-TIME RENDERING WITH DEEP LEARNING Shiqiu (Edward) Liu,

Limits of Superscalar Architecture Virendra Singh Associate Professor Computer Architecture and

CPI &lt; 1 Pipelined CPUs may have multiple execution units of different types (to

CENG3420 Lecture 12: Instruction-Level Parallelism Bei Yu byu@cse.cuhk.edu.hk (Latest update:

Proving Skipping Refinement with ACL2s Mitesh Jain and Pete Manolios Northeastern University

Sambuz

Useful Links

Newsletter

Mail Us

Superscalar Pipelines Slides developed by Joe Devietti, Milo Martin & Amir Roth at U. Penn

CPI < 1 Pipelined CPUs may have multiple execution units of different types (to