Kilo Instruction Processors Adrin Cristal 2/7/2019 YALE 80 - PowerPoint PPT Presentation

Kilo Instruction Processors Adrián Cristal 2/7/2019 YALE 80

Processor-DRAM Gap (latency) µProc 1000 CPU 60%/yr. “Moore’s Law” Performance 100 Processor-Memory Performance Gap: (grows 50% / year) 10 DRAM 7%/yr. DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time D.A. Patterson “ New directions in Computer Architecture” Berkeley, June 1998

Integer, 8-way, L2 1MB 5,0 Perfect Mem. & Perfect BP 4,5 Perfect Mem. & Perceptron BP 4,0 3,5 Memory 1.22X Latency 3,0 100 IPC 500 2,5 0.6X 1000 1.41X 2,0 1,5 1,0 0,5 0,0 128 512 1024 4096 128 512 1024 4096 perceptron perfect ROB Size / Branch Predictor Research Proposal to Intel (July 2001) and presentation to Intel-MRL March 2002 Cristal et al.: “Large Virtual ROBs by Processor Checkpointing”, TR UPC-DAC, July 2002 M. Valero. NSF Workshop on Computer Architecture. ISCA Conference. San Diego, June 2003

Floating-point, 8-way, L2 1MB 6,0 Perfect Mem. & Perfect BP 5,0 Perfect Mem. & Perceptron BP Memory 4,0 2.34X Latency 100 IPC 3,0 500 3.91X 1000 2,0 0.45X 1,0 0,0 128 512 1024 4096 8192 128 512 1024 4096 8192 perceptron perfect ROB Size / Branch Predictor Research Proposal to Intel (July 2001) and presentation to Intel-MRL March 2002 Cristal et al.: “Large Virtual ROBs by Processor Checkpointing”, TR UPC-DAC, July 2002 M. Valero. NSF Workshop on Computer Architecture. ISCA Conference. San Diego, June 2003

Execution Locality void smvp(int nodes, double ***A, int *Acol, int *Aindex, double **v, double **w) { [...] sum0 = A[Anext][0][0]*v[i][0] + A[Anext][0][1]*v[i][1] + A[Anext][0][2]*v[i][2]; [...] Miss-Dependent Cache-Dependent } Instruction Clusters Code 22.2%

Mapping Clusters to Processors • An execution cluster is a partition of the dynamic DDG belonging to the same locality group. High Locality Clusters :  large amount of instructions (70%) SpecFP, even more SpecINT  oo3 need to tolerate L2 cache hit latencies  advance as fast as possible (prefetching effect!)  thus the Cache Processor can be small, but must be Out-of-Order  Low Locality Clusters :  small amount of instructions (<30%)  generally not in critical path (Karkhanis, WMPI'02)  thus the Memory Processor can be even smaller, and probably In-Order 

oo3 again LLC instead of L2? osman.s.unsal osman.s.unsal, 7/2/2019

A different view: D-KIP Distribution of Instructions based on Decode->Issue Latency About 70% High Locality SPEC FP 2000  Processes only Cache Hit/Register dependent Insts  Miss-dependent Instructions  Latency Critical  Latency Tolerant 58%  Buffer few instructions (<100)  Buffer thousands of instructions  Speculative / Out-of-Order  Relaxed Scheduling  Executes most of control code  Little Control Code -> Few Recoveries KILO-Instruction  LD/ST intensive  Few address calculations processor  Memory Lookahead (Prefetching)  No caches, No fetch/decode logic model 2MB L2 Cache FETCH 400 cycles miss-dependent insts Cache Memory main memory Processor Processor access latency About 30% Low Locality measured in groups of 30 cycles DECOUPLED KILO-INSTRUCTION PROCESSOR (HPCA'06, 20% PACT'07) 15% 6% Miquel Pericas et al, “A decoupled KILO-instruction processor”, HPCA06

Flexible Heterogeneous MultiCore (I) Instruction Window oldest youngest Memory Engines ROB Cache Processor Memory Processor small out-of-order core small dual-issue in-order processors executing code depending designed assuming perfect on L2 cache misses. Each memory engine processes a portion of L2. It processes all cache- the instruction window. Inclusion of all loads/stores provides sequential dependent code memory semantics. Activation and de-activation of memory engines changes the window size and allows to control power consumption Cache- Miss-Dependent Code Dependent Code Miquel Pericas et al, “A Flexible Heterogeneous Multi-Core Architecture”, PACT07

Flexible Heterogeneous MultiCore (II) Extension to MultiThreading Dynamically Assigned Pool of MEs Cache Processors ROB ROB ROB Pool of Memory Engines can be shared for higher throughput/fairness

Kilo, Runahead and Prefetching • Prefetching • Anticipates memory requests • Reduces the impact of misses in the memory hierarchy • Runahead Mechanism • Executes speculative instructions under a LLC miss • Prevents the processor from stall when the ROB is full • Allows generating useful data prefetch • Kilo-instruction Processors • Exploits more ILP by maintaining thousands of in-flight instructions while long-latency loads are outstanding in memory (implicit prefetching) Tanausú Ramírez et al., “Kilo-instruction Processors, Runahead and Prefetching”, CF’06, May, 2006.

Performance versus RunAhead and Stride Prefetching •OoO and RunAhead are 4-way with 64/256-entry ROBs •Cache Processors are 4-way with 64-entry ROB •Memory Processor/Memory Engines are two-way in-order processors •A Memory Engine can hold up to 128 long-latency instructions and 128 loads/stores •RunAhead features ideal runahead cache •Stream prefetcher can hold up to 64KB of prefetched data

“Kilo-processor” and multiprocessor systems 3,5 IDEAL NET BADA 3 IDEAL NET & MEM 2,5 2 IPC 1,5 1 0,5 0 64 128 512 1024 2048 64 128 512 1024 2048 64 128 512 1024 2048 FFT RADIX LU ROB Size / Benchmark M. Galluzzi et al. “ A First glance at Kiloinstruction Based Multiprocessors” Invited Paper. ACM Computing Frontiers Conference. Ischia, Italy, April 10-12, 2004

What we wanted to do • Can we extend a Big-Little multicore to implement the FMC? • Are the Memory Engines (Mes) used all the time or are they waiting for long latency loads? • Can we do something to avoid discarding all the MEs in case of branch mispredictions? • How does a practical kilo-vector processor look like?

Some ideas “stolen” from “Edge Processors” and “Decoupled Architectures” Cache Processors Pool of Pool of MEs Waiting Queues … Split the functionality of the MEs, the instruction queues and the functional units

Waiting Queue • Instructions + Logical Registers • Wait until all used logical registers are ready • Assign a Memory Engine Waiting Inst Inst Inst … Inst Memory Engine Waiting Input Register File

Where to start a waiting queue • Loop: Loop: • … Loop: • … Loop: • If: Br … • … • … Loop: • Else: … Loop: Loop: • … • … Loop: • Fi: Loop: • … If • Endloop

Where to start a waiting queue • Loop: Loop: • … Loop: • … Loop: • If: Br … • … • … Loop: • Else: … Loop: Loop: • … • … Loop: • Fi: Loop: • … Else If • Endloop R1<-

Where to start a waiting queue • Loop: New R1, R2 Loop: • … Loop: • … Loop: • If: Br … • … New R1 • … Loop: • Else: … Loop: Loop: • … • … Loop: • Fi: Loop: • … Else • Endloop R1<-

Problems and more problems • What to do if the addresses of Loads and Stores are modified • Fetching instructions, and partial reexecution • Pointer Chasing • Start a new waiting queue or suspend the execution of a waiting queue?

“Kilo-vector” processor 20 80 Program: Vector 20 8 Program: Speedup: 3.5 Kilo 5 8 Program: Speedup: 7.7 F. Quintana et al, “Kilo-vector” processors, UPC-DAC

Adrian.cristal@bsc.es

Kilo Instruction Processors Adrin Cristal 2/7/2019 YALE 80 - PowerPoint PPT Presentation

Kilo Instruction Processors Adrin Cristal 2/7/2019 YALE 80 Processor-DRAM Gap (latency) Proc 1000 CPU 60%/yr. Moores Law Performance 100 Processor-Memory Performance Gap: (grows 50% / year) 10 DRAM 7%/yr. DRAM 1

VLIW Processors VLIW (very long instruction word) processors instructions are scheduled

Instruction Set 2 Architecting a vocabulary for the HW INSTRUCTION SET OVERVIEW 3 Instruction

Hardw are/Softw are Hardw are/Softw are Hardw are/Softw are Instruction Set Configurability

General Plan Comprehensive Review 1 P LANNER S C HALLENGE Trends and Data Kilo ilo Analyze

Kilo Degree Survey F. Khlinger, B. Joachimi, S. Joudaki, L. Miller on behalf of the

Scalable 10 to 20 Kilo-pixel MKID Signal Generation and DAQ for Cosmology Gustavo Cancelo

ON I.MX PROCESSORS? SCALABLE I.MX PROCESSOR OVERVIEW AND INSTRUCTION TO BUILD AGL TATSUYA

RISC Processors Chapter 14 S. Dandamudi Outline Introduction Itanium processor

EXPLICIT INSTRUCTION EXPLICIT INSTRUCTION Michael L. Kamil Michael L. Kamil Stanford University

Lecture 3: Instruction Lecture 3: Instruction of a computer that a machine language of a

EE 457 Unit 3 Instruction Sets 2 With Focus on our Case Study: MIPS INSTRUCTION SET OVERVIEW 3

EE 109 Unit 10 MIPS Instruction Set MIPS INSTRUCTION OVERVIEW 10.3 10.4 Instruction Set

Instruction encoding The ISA defines The format of an instruction (syntax) The

Slide Handouts: Instruction Ask the Expert Welcome to Module 6 Lesson 1. Instruction: Ask the

Instruction Scheduling cs5363 1 Instruction scheduling Reordered Original Instruction code

EE 457 Unit 3 Instruction Sets With Focus on our Case Study: MIPS INSTRUCTION SET OVERVIEW 3.3

The structure of the argument Evidence from Polish: Argument 1 Predication from within a PP and

called Bethesda, which has five alcoves. In these lay many invalids blind, lame, and

A New Kid on the Block: CLINT - a Cryptographic Library for the INternet of Things Mike Scott

10 Million Outdoor Kid Goal Update for NWF Affiliates Kevin Coyle & Meri-Margaret Deoudes

Exam 1 solutions 1. A cube of metal has a mass of 0.5 kg. It measures 2.1 cm on a side.

Configurable TLB Hierarchy for the Rocket Chip Generator Nikos Ch. Papadopoulos , Vasileios

DEMAND ELASTICITY Overview Context: Product manager wants to estimate impact of price change

Graphical User Interface (GUI) Programming Joan Boone jpboone@email.unc.edu Summer 2020 Slide 1

Kilo Instruction Processors Adrin Cristal 2/7/2019 YALE 80 - PowerPoint PPT Presentation

Kilo Instruction Processors Adrin Cristal 2/7/2019 YALE 80 Processor-DRAM Gap (latency) Proc 1000 CPU 60%/yr. Moores Law Performance 100 Processor-Memory Performance Gap: (grows 50% / year) 10 DRAM 7%/yr. DRAM 1

VLIW Processors VLIW (very long instruction word) processors instructions are scheduled

Instruction Set 2 Architecting a vocabulary for the HW INSTRUCTION SET OVERVIEW 3 Instruction

Hardw are/Softw are Hardw are/Softw are Hardw are/Softw are Instruction Set Configurability

General Plan Comprehensive Review 1 P LANNER S C HALLENGE Trends and Data Kilo ilo Analyze

Kilo Degree Survey F. Khlinger, B. Joachimi, S. Joudaki, L. Miller on behalf of the

Scalable 10 to 20 Kilo-pixel MKID Signal Generation and DAQ for Cosmology Gustavo Cancelo

ON I.MX PROCESSORS? SCALABLE I.MX PROCESSOR OVERVIEW AND INSTRUCTION TO BUILD AGL TATSUYA

RISC Processors Chapter 14 S. Dandamudi Outline Introduction Itanium processor

EXPLICIT INSTRUCTION EXPLICIT INSTRUCTION Michael L. Kamil Michael L. Kamil Stanford University

Lecture 3: Instruction Lecture 3: Instruction of a computer that a machine language of a

EE 457 Unit 3 Instruction Sets 2 With Focus on our Case Study: MIPS INSTRUCTION SET OVERVIEW 3

EE 109 Unit 10 MIPS Instruction Set MIPS INSTRUCTION OVERVIEW 10.3 10.4 Instruction Set

Instruction encoding The ISA defines The format of an instruction (syntax) The

Slide Handouts: Instruction Ask the Expert Welcome to Module 6 Lesson 1. Instruction: Ask the

Instruction Scheduling cs5363 1 Instruction scheduling Reordered Original Instruction code

EE 457 Unit 3 Instruction Sets With Focus on our Case Study: MIPS INSTRUCTION SET OVERVIEW 3.3

The structure of the argument Evidence from Polish: Argument 1 Predication from within a PP and

called Bethesda, which has five alcoves. In these lay many invalids blind, lame, and

A New Kid on the Block: CLINT - a Cryptographic Library for the INternet of Things Mike Scott

10 Million Outdoor Kid Goal Update for NWF Affiliates Kevin Coyle &amp; Meri-Margaret Deoudes

Exam 1 solutions 1. A cube of metal has a mass of 0.5 kg. It measures 2.1 cm on a side.

Configurable TLB Hierarchy for the Rocket Chip Generator Nikos Ch. Papadopoulos , Vasileios

DEMAND ELASTICITY Overview Context: Product manager wants to estimate impact of price change

Graphical User Interface (GUI) Programming Joan Boone jpboone@email.unc.edu Summer 2020 Slide 1

10 Million Outdoor Kid Goal Update for NWF Affiliates Kevin Coyle & Meri-Margaret Deoudes