Kilo Instruction Processors Adrián Cristal 2/7/2019 YALE 80
Processor-DRAM Gap (latency) µProc 1000 CPU 60%/yr. “Moore’s Law” Performance 100 Processor-Memory Performance Gap: (grows 50% / year) 10 DRAM 7%/yr. DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time D.A. Patterson “ New directions in Computer Architecture” Berkeley, June 1998
Integer, 8-way, L2 1MB 5,0 Perfect Mem. & Perfect BP 4,5 Perfect Mem. & Perceptron BP 4,0 3,5 Memory 1.22X Latency 3,0 100 IPC 500 2,5 0.6X 1000 1.41X 2,0 1,5 1,0 0,5 0,0 128 512 1024 4096 128 512 1024 4096 perceptron perfect ROB Size / Branch Predictor Research Proposal to Intel (July 2001) and presentation to Intel-MRL March 2002 Cristal et al.: “Large Virtual ROBs by Processor Checkpointing”, TR UPC-DAC, July 2002 M. Valero. NSF Workshop on Computer Architecture. ISCA Conference. San Diego, June 2003
Floating-point, 8-way, L2 1MB 6,0 Perfect Mem. & Perfect BP 5,0 Perfect Mem. & Perceptron BP Memory 4,0 2.34X Latency 100 IPC 3,0 500 3.91X 1000 2,0 0.45X 1,0 0,0 128 512 1024 4096 8192 128 512 1024 4096 8192 perceptron perfect ROB Size / Branch Predictor Research Proposal to Intel (July 2001) and presentation to Intel-MRL March 2002 Cristal et al.: “Large Virtual ROBs by Processor Checkpointing”, TR UPC-DAC, July 2002 M. Valero. NSF Workshop on Computer Architecture. ISCA Conference. San Diego, June 2003
Execution Locality void smvp(int nodes, double ***A, int *Acol, int *Aindex, double **v, double **w) { [...] sum0 = A[Anext][0][0]*v[i][0] + A[Anext][0][1]*v[i][1] + A[Anext][0][2]*v[i][2]; [...] Miss-Dependent Cache-Dependent } Instruction Clusters Code 22.2%
Mapping Clusters to Processors • An execution cluster is a partition of the dynamic DDG belonging to the same locality group. High Locality Clusters : large amount of instructions (70%) SpecFP, even more SpecINT oo3 need to tolerate L2 cache hit latencies advance as fast as possible (prefetching effect!) thus the Cache Processor can be small, but must be Out-of-Order Low Locality Clusters : small amount of instructions (<30%) generally not in critical path (Karkhanis, WMPI'02) thus the Memory Processor can be even smaller, and probably In-Order
Slide 6 oo3 again LLC instead of L2? osman.s.unsal osman.s.unsal, 7/2/2019
A different view: D-KIP Distribution of Instructions based on Decode->Issue Latency About 70% High Locality SPEC FP 2000 Processes only Cache Hit/Register dependent Insts Miss-dependent Instructions Latency Critical Latency Tolerant 58% Buffer few instructions (<100) Buffer thousands of instructions Speculative / Out-of-Order Relaxed Scheduling Executes most of control code Little Control Code -> Few Recoveries KILO-Instruction LD/ST intensive Few address calculations processor Memory Lookahead (Prefetching) No caches, No fetch/decode logic model 2MB L2 Cache FETCH 400 cycles miss-dependent insts Cache Memory main memory Processor Processor access latency About 30% Low Locality measured in groups of 30 cycles DECOUPLED KILO-INSTRUCTION PROCESSOR (HPCA'06, 20% PACT'07) 15% 6% Miquel Pericas et al, “A decoupled KILO-instruction processor”, HPCA06
Flexible Heterogeneous MultiCore (I) Instruction Window oldest youngest Memory Engines ROB Cache Processor Memory Processor small out-of-order core small dual-issue in-order processors executing code depending designed assuming perfect on L2 cache misses. Each memory engine processes a portion of L2. It processes all cache- the instruction window. Inclusion of all loads/stores provides sequential dependent code memory semantics. Activation and de-activation of memory engines changes the window size and allows to control power consumption Cache- Miss-Dependent Code Dependent Code Miquel Pericas et al, “A Flexible Heterogeneous Multi-Core Architecture”, PACT07
Flexible Heterogeneous MultiCore (II) Extension to MultiThreading Dynamically Assigned Pool of MEs Cache Processors ROB ROB ROB Pool of Memory Engines can be shared for higher throughput/fairness
Kilo, Runahead and Prefetching • Prefetching • Anticipates memory requests • Reduces the impact of misses in the memory hierarchy • Runahead Mechanism • Executes speculative instructions under a LLC miss • Prevents the processor from stall when the ROB is full • Allows generating useful data prefetch • Kilo-instruction Processors • Exploits more ILP by maintaining thousands of in-flight instructions while long-latency loads are outstanding in memory (implicit prefetching) Tanausú Ramírez et al., “Kilo-instruction Processors, Runahead and Prefetching”, CF’06, May, 2006.
Performance versus RunAhead and Stride Prefetching •OoO and RunAhead are 4-way with 64/256-entry ROBs •Cache Processors are 4-way with 64-entry ROB •Memory Processor/Memory Engines are two-way in-order processors •A Memory Engine can hold up to 128 long-latency instructions and 128 loads/stores •RunAhead features ideal runahead cache •Stream prefetcher can hold up to 64KB of prefetched data
“Kilo-processor” and multiprocessor systems 3,5 IDEAL NET BADA 3 IDEAL NET & MEM 2,5 2 IPC 1,5 1 0,5 0 64 128 512 1024 2048 64 128 512 1024 2048 64 128 512 1024 2048 FFT RADIX LU ROB Size / Benchmark M. Galluzzi et al. “ A First glance at Kiloinstruction Based Multiprocessors” Invited Paper. ACM Computing Frontiers Conference. Ischia, Italy, April 10-12, 2004
What we wanted to do • Can we extend a Big-Little multicore to implement the FMC? • Are the Memory Engines (Mes) used all the time or are they waiting for long latency loads? • Can we do something to avoid discarding all the MEs in case of branch mispredictions? • How does a practical kilo-vector processor look like?
Some ideas “stolen” from “Edge Processors” and “Decoupled Architectures” Cache Processors Pool of Pool of MEs Waiting Queues … Split the functionality of the MEs, the instruction queues and the functional units
Waiting Queue • Instructions + Logical Registers • Wait until all used logical registers are ready • Assign a Memory Engine Waiting Inst Inst Inst … Inst Memory Engine Waiting Input Register File
Where to start a waiting queue • Loop: Loop: • … Loop: • … Loop: • If: Br … • … • … Loop: • Else: … Loop: Loop: • … • … Loop: • Fi: Loop: • … If • Endloop
Where to start a waiting queue • Loop: Loop: • … Loop: • … Loop: • If: Br … • … • … Loop: • Else: … Loop: Loop: • … • … Loop: • Fi: Loop: • … Else If • Endloop R1<-
Where to start a waiting queue • Loop: New R1, R2 Loop: • … Loop: • … Loop: • If: Br … • … New R1 • … Loop: • Else: … Loop: Loop: • … • … Loop: • Fi: Loop: • … Else • Endloop R1<-
Problems and more problems • What to do if the addresses of Loads and Stores are modified • Fetching instructions, and partial reexecution • Pointer Chasing • Start a new waiting queue or suspend the execution of a waiting queue?
“Kilo-vector” processor 20 80 Program: Vector 20 8 Program: Speedup: 3.5 Kilo 5 8 Program: Speedup: 7.7 F. Quintana et al, “Kilo-vector” processors, UPC-DAC
Adrian.cristal@bsc.es
Recommend
More recommend