The Rise and the fall of Scratch Pads Aviral Shrivastava Compiler Microarchitecture Labs Arizona State University M M C C L L
Utopia of Caches • Few other things affect programming as much as the memory architecture – No. of registers – Pipeline structure – Bypasses • Illusion of large unified memory makes programming simple – Coherent caches – Unique address of a variable – it’s name – Cache gets the latest value of the variable from M M wherever in the memory C C L L
SPMs for Power, Perf., and Area 9 Energy per access [nJ] . 8 7 Tag 6 Data Array Scratch pad Array 5 Cache, 2way, 4GB space Cache, 2way, 16 MB space 4 Cache, 2way, 1 MB space 3 2 1 Tag Comparators, Address 0 Muxes Decoder 256 512 1024 2048 4096 8192 16384 memory size Cache SPM • 40% less energy as compared to cache [Banakar02] – Absence of tag arrays, comparators and muxes • 34 % less area as compared to cache of same size [Banakar02] – Simple hardware design (only a memory array & address decoding circuitry) • Faster access to SPM than cache M C L
SPMs for Predictability • In hard real-time systems WCET analysis is essential – Can add an application, only if all their WCETs fit in the period • Caches: Estimating the number of misses in a program is at least doubly exponential. – Presburger arithmetic with nested existential operators – Simply assume no cache – WCET very large • With static data mapping on SPM – Tighter WCET -- can fit more applications M M C C L L
Rise of SPMs • SuperH in Sega Gaming Consoles used SPMs • Sony Playstations have extensively used SPMs – PS1: could use SPM for stack data – PS2: 16KB SPM – PS3: Each SPU has 256KB SPM • Intel Network processors have used SPMs • Graphics Processing Units GPUs use SPMs Sega Saturn – Nvidia Tesla • Many embedded processors used line locking – Coldfire MCF5249, PowerPC440, MPC5554, ARM940, and ARM946E-S • SPMs remained in embedded systems Sony Playstation M M C C L L
A storm is brewing • Power and temperature are becoming key design concerns – Multi-cores seem to be the solution • Each core is smaller and less perf, but still high throughput – Perf sys = n*Perf core – Power sys = n*Power core – PE sys = PE core • Throughput is the new metric – Throughput increase is by n – Each core needs to be as low-power as possible – Energy hogs need to go away • Out-of order execution • Register renaming • Branch prediction M M C C L L
Era of disillusionment • Illusion of unified memory is breaking – Cache coherency protocols do not scale beyond tens of cores • Tilera64 has coherent cache architecture • Intel 48-core and 80-core have non-coherent caches – Big push towards software coherency and TLM • Most of the times, do not need it • Rarely done things can be slower ( Ahmdal’s law) • Illusion of large memory is breaking – Reduce the automation of cache – Software exposed to distributed memory reality • SGI Altix, 320 GB RAM, ½ Million dollars – MPI-like communication coming inside the core • Most of the time, core can operate on local data M M C C L L
Limited Local Memory (LLM) Architecture • Distributed memory platform with each core having its own small scratch pad memory • Cores can access only local memory • Access to the global memory is through explicit DMA calls in the application program • Ex. IBM Cell Broadband Engine M M C C L L
LLM Programming Model #include<libspe2.h> <spu_mfcio.h> <spu_mfcio.h> <spu_mfcio.h> extern spe_program_handle_t int main(speid, int main(speid, int main(speid, hello_spu; argp) argp) argp) { { { int main(void) { printf("Hello printf("Hello printf("Hello world!\n"); world!\n"); world!\n"); int speid, status; return 0; return 0; return 0; } } } speid = spe_create_thread Local Core Local Core Local Core (&hello_spu); <spu_mfcio.h> <spu_mfcio.h> <spu_mfcio.h> spe_wait( speid, &status); int main(speid, int main(speid, int main(speid, return 0; argp) argp) argp) } { { { printf("Hello printf("Hello printf("Hello world!\n"); world!\n"); world!\n"); return 0; return 0; return 0; Main Core } } } Local Core Local Core Local Core • Extremely power-efficient execution if – all code and application data can fit in the local memory M M C C L L
Managing Data on Limited Local Memory • Shared LLM for all data and code stack – No protection • All data needs to be managed heap • Code heap – May not fit • Stack heap – May grow and overwrite heap data, or code global • Heap code – May grow and overwrite stack data • IBM’s Fix: – Software Cache – Do not use pointers M – Do not use recursion C L
Using LLM is difficult int global; int global; f1(){ f1(){ int a,b; int a,b; global = a + b; glob = DSPM.fetch(global); glob = a + b; f2(); DSPM.writeback(glob, global); } ISPM.fetch(f2); Original Code f2(); } Using SPM M C L
LLM different than SPM SPU LLM SPM ARM Cache DMA DMA Global Global Memory Memory SPU LLM Architecture ARM Memory Architecture • Programs work without using SPM • Programs will not work without LLM – SPM for optimization – SPM essential for execution – by placing frequently used data in SPM – Need to make it’s use more efficient • “What to place in SPM?” • “What to place in SPM?” – Can be more than SPM size – everything • “Where to place in SPM?” • “Where to place in SPM?” M M C C L L
Outline 0. Global 1. Code management • [HIPC 2008] SDRM: Simultaneous Determination of Regions and Function- to-Region Mapping for Scratchpad Memories • [ASAP 2010] Dynamic Code Mapping for Limited Local Memory Systems M C L
Code Management Mechanism F1 F2 stack F3 heap F1 F1 (a) Application Call Graph variable F2 F2 F2 F3 F3 SECTIONS { code OVERLAY { F3 F1 F1.o F3.o (c) Local Memory (d) Main Memory } OVERLAY { F2.o } } M M (b) Linker Script C C L L http://www.public.asu.edu/~ashriva6
Code Management Problem REGION REGION REGION • • • Local Memory Code Section • # of Regions and Function-To-Region Mapping – Two extreme cases • Code management is NP-Complete – Minimum data transfer with given space M M C C L L 15 11/30/2010 http://www.public.asu.edu/~ashriva6
Capturing call pattern 1KB F1(){ F1 F2(); 1 1 F3(); 1KB 1KB F3 F2 } 10 10 10 100 F2(){ 1KB 1KB 1KB 1KB for(i=0;i<10;i++){ F6 F7 F4 F5 F4(); } (b) Call Graph for(i=0;i<100;i++){ F5(); 1KB } F1 } 1 1 F3(){ 1KB 1KB F2 F3 Explicit for(i=0;i<10;i++){ 1 1 1 Execution F6(); 10 100 10 L1 L2 L3 F7(); Order 1 1 } 1 1 1KB 1KB 1KB 1KB } F4 F5 F6 F7 (a) Example Application (c) GCCFG M M C C L L 16 11/30/2010 http://www.public.asu.edu/~ashriva6
FMUM Heuristic Maximum (7.5KB) Given (5.5KB) F1 F1 1KB 1.5KB F1,F5 1.5KB F2 1.5KB F2 F2 1.5KB 1.5KB F3 0.5KB F3 0.5KB F3 0.5KB F4 2KB F4 2KB F4 2KB F5 1.5KB F5 F6 F6 1KB F6 1KB (a) Start (b) Next step (c) Final M M C C L L 17 11/30/2010 http://www.public.asu.edu/~ashriva6
FMUP Heuristic • Minimum (2KB) Given Size (5KB) (a) START (c) STEP2 (e) FINAL 1.5KB F1 F2 F3 2KB F4 F5 F6 2KB New Region New Region 1.5KB 1.5KB M M (b) STEP1 (d) STEP3 C C L L 18 11/30/2010 http://www.public.asu.edu/~ashriva6
Typical Performance Result Stringsearch Millions Total Number of Execution Cycles 11 FMUM FMUM FMUP SDRM performs better 10.5 10 9.5 9 FMUP performs 8.5 better 8 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 3400 3600 3792 Given Code Space in Limited Local Memory M M C C L L 19 11/30/2010 http://www.public.asu.edu/~ashriva6
Outline 2. Stack management 0. Global 1. Code management [ASPDAC 2009] A Software Solution for Dynamic Stack Management on Scratch Pad Memory M C L
Circular Stack Management Stack Size = 70 bytes F1 F3 30 F2 70 Main F3 MemPtr F1 50 F2 20 SP Function Frame Size 0 (bytes) F1 50 F2 20 M F3 30 C L Local Memory Main Memory
Circular Stack Management Stack Size = 70 bytes F1 F2 70 F3 F3 30 SP F1 50 Main F2 20 MemPtr Function Frame Size (bytes) 0 F1 50 F2 20 M F3 30 C L Local Memory Main Memory
How to evict data to global memory? • Can use DMA to transfer heap object to global memory — DMA is very fast – no core-to-core communication • But eventually, you can overwrite some other data • Need OS mediation Global Memory Execution DMA Core malloc Global Main Core Memory Execution Core malloc malloc • Thread communication between cores is slow! M M C C L L 23
Hybrid DMA + Communication • Can use DMA to transfer heap object to global memory — DMA is very fast – no core-to-core communication • But eventually, you can overwrite some other data • Need OS mediation DMA write from local memory to global memory S if (enough space in global memory) then allocate write data using DMA mail-box based ≥ S space else communication request more space in global memory startAddr endAddr Global Memory Execution Thread on execution core Main core M M C C L L 24
Recommend
More recommend