kilo instruction processors
play

Kilo Instruction Processors Adrin Cristal 2/7/2019 YALE 80 - PowerPoint PPT Presentation

Kilo Instruction Processors Adrin Cristal 2/7/2019 YALE 80 Processor-DRAM Gap (latency) Proc 1000 CPU 60%/yr. Moores Law Performance 100 Processor-Memory Performance Gap: (grows 50% / year) 10 DRAM 7%/yr. DRAM 1


  1. Kilo Instruction Processors Adrián Cristal 2/7/2019 YALE 80

  2. Processor-DRAM Gap (latency) µProc 1000 CPU 60%/yr. “Moore’s Law” Performance 100 Processor-Memory Performance Gap: (grows 50% / year) 10 DRAM 7%/yr. DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time D.A. Patterson “ New directions in Computer Architecture” Berkeley, June 1998

  3. Integer, 8-way, L2 1MB 5,0 Perfect Mem. & Perfect BP 4,5 Perfect Mem. & Perceptron BP 4,0 3,5 Memory 1.22X Latency 3,0 100 IPC 500 2,5 0.6X 1000 1.41X 2,0 1,5 1,0 0,5 0,0 128 512 1024 4096 128 512 1024 4096 perceptron perfect ROB Size / Branch Predictor Research Proposal to Intel (July 2001) and presentation to Intel-MRL March 2002 Cristal et al.: “Large Virtual ROBs by Processor Checkpointing”, TR UPC-DAC, July 2002 M. Valero. NSF Workshop on Computer Architecture. ISCA Conference. San Diego, June 2003

  4. Floating-point, 8-way, L2 1MB 6,0 Perfect Mem. & Perfect BP 5,0 Perfect Mem. & Perceptron BP Memory 4,0 2.34X Latency 100 IPC 3,0 500 3.91X 1000 2,0 0.45X 1,0 0,0 128 512 1024 4096 8192 128 512 1024 4096 8192 perceptron perfect ROB Size / Branch Predictor Research Proposal to Intel (July 2001) and presentation to Intel-MRL March 2002 Cristal et al.: “Large Virtual ROBs by Processor Checkpointing”, TR UPC-DAC, July 2002 M. Valero. NSF Workshop on Computer Architecture. ISCA Conference. San Diego, June 2003

  5. Execution Locality void smvp(int nodes, double ***A, int *Acol, int *Aindex, double **v, double **w) { [...] sum0 = A[Anext][0][0]*v[i][0] + A[Anext][0][1]*v[i][1] + A[Anext][0][2]*v[i][2]; [...] Miss-Dependent Cache-Dependent } Instruction Clusters Code 22.2%

  6. Mapping Clusters to Processors • An execution cluster is a partition of the dynamic DDG belonging to the same locality group. High Locality Clusters :  large amount of instructions (70%) SpecFP, even more SpecINT  oo3 need to tolerate L2 cache hit latencies  advance as fast as possible (prefetching effect!)  thus the Cache Processor can be small, but must be Out-of-Order  Low Locality Clusters :  small amount of instructions (<30%)  generally not in critical path (Karkhanis, WMPI'02)  thus the Memory Processor can be even smaller, and probably In-Order 

  7. Slide 6 oo3 again LLC instead of L2? osman.s.unsal osman.s.unsal, 7/2/2019

  8. A different view: D-KIP Distribution of Instructions based on Decode->Issue Latency About 70% High Locality SPEC FP 2000  Processes only Cache Hit/Register dependent Insts  Miss-dependent Instructions  Latency Critical  Latency Tolerant 58%  Buffer few instructions (<100)  Buffer thousands of instructions  Speculative / Out-of-Order  Relaxed Scheduling  Executes most of control code  Little Control Code -> Few Recoveries KILO-Instruction  LD/ST intensive  Few address calculations processor  Memory Lookahead (Prefetching)  No caches, No fetch/decode logic model 2MB L2 Cache FETCH 400 cycles miss-dependent insts Cache Memory main memory Processor Processor access latency About 30% Low Locality measured in groups of 30 cycles DECOUPLED KILO-INSTRUCTION PROCESSOR (HPCA'06, 20% PACT'07) 15% 6% Miquel Pericas et al, “A decoupled KILO-instruction processor”, HPCA06

  9. Flexible Heterogeneous MultiCore (I) Instruction Window oldest youngest Memory Engines ROB Cache Processor Memory Processor small out-of-order core small dual-issue in-order processors executing code depending designed assuming perfect on L2 cache misses. Each memory engine processes a portion of L2. It processes all cache- the instruction window. Inclusion of all loads/stores provides sequential dependent code memory semantics. Activation and de-activation of memory engines changes the window size and allows to control power consumption Cache- Miss-Dependent Code Dependent Code Miquel Pericas et al, “A Flexible Heterogeneous Multi-Core Architecture”, PACT07

  10. Flexible Heterogeneous MultiCore (II) Extension to MultiThreading Dynamically Assigned Pool of MEs Cache Processors ROB ROB ROB Pool of Memory Engines can be shared for higher throughput/fairness

  11. Kilo, Runahead and Prefetching • Prefetching • Anticipates memory requests • Reduces the impact of misses in the memory hierarchy • Runahead Mechanism • Executes speculative instructions under a LLC miss • Prevents the processor from stall when the ROB is full • Allows generating useful data prefetch • Kilo-instruction Processors • Exploits more ILP by maintaining thousands of in-flight instructions while long-latency loads are outstanding in memory (implicit prefetching) Tanausú Ramírez et al., “Kilo-instruction Processors, Runahead and Prefetching”, CF’06, May, 2006.

  12. Performance versus RunAhead and Stride Prefetching •OoO and RunAhead are 4-way with 64/256-entry ROBs •Cache Processors are 4-way with 64-entry ROB •Memory Processor/Memory Engines are two-way in-order processors •A Memory Engine can hold up to 128 long-latency instructions and 128 loads/stores •RunAhead features ideal runahead cache •Stream prefetcher can hold up to 64KB of prefetched data

  13. “Kilo-processor” and multiprocessor systems 3,5 IDEAL NET BADA 3 IDEAL NET & MEM 2,5 2 IPC 1,5 1 0,5 0 64 128 512 1024 2048 64 128 512 1024 2048 64 128 512 1024 2048 FFT RADIX LU ROB Size / Benchmark M. Galluzzi et al. “ A First glance at Kiloinstruction Based Multiprocessors” Invited Paper. ACM Computing Frontiers Conference. Ischia, Italy, April 10-12, 2004

  14. What we wanted to do • Can we extend a Big-Little multicore to implement the FMC? • Are the Memory Engines (Mes) used all the time or are they waiting for long latency loads? • Can we do something to avoid discarding all the MEs in case of branch mispredictions? • How does a practical kilo-vector processor look like?

  15. Some ideas “stolen” from “Edge Processors” and “Decoupled Architectures” Cache Processors Pool of Pool of MEs Waiting Queues … Split the functionality of the MEs, the instruction queues and the functional units

  16. Waiting Queue • Instructions + Logical Registers • Wait until all used logical registers are ready • Assign a Memory Engine Waiting Inst Inst Inst … Inst Memory Engine Waiting Input Register File

  17. Where to start a waiting queue • Loop: Loop: • … Loop: • … Loop: • If: Br … • … • … Loop: • Else: … Loop: Loop: • … • … Loop: • Fi: Loop: • … If • Endloop

  18. Where to start a waiting queue • Loop: Loop: • … Loop: • … Loop: • If: Br … • … • … Loop: • Else: … Loop: Loop: • … • … Loop: • Fi: Loop: • … Else If • Endloop R1<-

  19. Where to start a waiting queue • Loop: New R1, R2 Loop: • … Loop: • … Loop: • If: Br … • … New R1 • … Loop: • Else: … Loop: Loop: • … • … Loop: • Fi: Loop: • … Else • Endloop R1<-

  20. Problems and more problems • What to do if the addresses of Loads and Stores are modified • Fetching instructions, and partial reexecution • Pointer Chasing • Start a new waiting queue or suspend the execution of a waiting queue?

  21. “Kilo-vector” processor 20 80 Program: Vector 20 8 Program: Speedup: 3.5 Kilo 5 8 Program: Speedup: 7.7 F. Quintana et al, “Kilo-vector” processors, UPC-DAC

  22. Adrian.cristal@bsc.es

Recommend


More recommend