region caching motivation region caching motivation
play

Region Caching: Motivation Region Caching: Motivation High Level - PDF document

Gary Tyson Region Caching: Motivation Region Caching: Motivation High Level Languages influence the memory reference behavior f b h i Caused by translating complex semantics into simple code (somewhat application independent)


  1. Gary Tyson Region Caching: Motivation Region Caching: Motivation  High Level Languages influence the memory reference behavior f b h i – Caused by translating complex semantics into simple code (somewhat application independent)  Programming conventions also predictably influence memory reference behavior  Exploiting memory region characteristics can p g y g lead to more effective caches  Attacking each region individually – finding optimal cache designs for each region 1/45 Memory Space Partitioning Memory Space Partitioning max mem  Based on programming reserved language Dynamic Data Region  Non-overlapped Protected subdivisions Dynamic Data Region Static Data Region Code Region Static Data Region reserved min mem MIPS Architecture 2/45 1

  2. Gary Tyson Memory Space Partitioning Memory Space Partitioning max mem  Based on programming reserved language Dynamic Data Region  Non-overlapped Protected subdivisions Dynamic Data Region  Split code and data  I cache & D cache I-cache & D-cache Static Data Region Code Region Static Data Region reserved min mem MIPS Architecture 3/45 Memory Space Partitioning Memory Space Partitioning max mem  Based on programming reserved language Stack grows downward  Non-overlapped Protected subdivisions Heap grows upward  Split code and data  I- cache & D cache cache & D-cache Static Global Data Region  Split data into regions Code Region – Stack (  ) Read-only data – Heap (  ) reserved min mem MIPS Architecture – Global (static) – Read-only (static) 4/45 2

  3. Gary Tyson Stack Reference of Memory Instructions Stack Reference of Memory Instructions 100% 90% 80% 70% Read-only 60% Heap ref 50% Static ref 40% Stack ref 30% 20% 10% 0% c p f 2 y n p r f x k r g c l p p t o a c i e o e m v f z m g s v i a e g g w t A z r b r r b a t o c l v r p e p 5/45 Stack + Global Stack + Global 100% 90% 80% 70% Read-only 60% Heap ref 50% Static ref 40% Stack ref 30% 20% 10% 0% y c p f 2 n p r f x k r g c l p p t o a c i e o e m v f z m g s w v i a e g g t A z r b r r b a t o l c v r p e p 6/45 3

  4. Gary Tyson Stack + Global + Heap Stack + Global + Heap 100% 90% 80% 70% Read-only 60% Heap ref 50% Static ref 40% Stack ref 30% 20% 10% 0% c p f 2 y n p r f x k r g c l p p t o a c i e o e m v f z m g s v i a e g g w t A z r b r r b a t o c l v r p e p 7/45 Stack + Global + Heap + Read- Stack + Global + Heap + Read - only Data only Data 100% 90% 80% 70% Read-only 60% Heap ref 50% Static ref 40% Stack ref 30% 20% 10% 0% y c p f 2 n p r f x k r g c l p p t o a c i e o e m v f z m g s w v i a e g g t A z r b r r b a t o l c v r p e p 8/45 4

  5. Gary Tyson Region Region- -based Partitioning based Partitioning  Run-time virtual memory address space is partitioned by programming languages.  Can we reduce power while retaining performance ?  Reference patterns and characteristics of these data are different. 9/45 Miss Rate by Memory Region Miss Rate by Memory Region 0.3000 0.2500 0.2000 H eap data miss rate 0.1500 G lobal static data Stack data 0.1000 0.0500 0.0000 256 512 1k 2k 4k 8k 16k 32k 64k Individual C ache Size  Stack data level off quickly; so do global data  Heap drops linearly every time cache size doubled 10/45 5

  6. Gary Tyson Region Region- -based Cachelets (RBC) based Cachelets (RBC)  A simple idea p address  A horizontal region select Address DeMultiplexer partitioning  Clock gated caches cs cs cs RC1 RC2 stack static  Only enable (cycle) L1 Cache the region cachelet the region cachelet being accessed  Redirect >70% Processor Data Bus accesses to smaller region cachelets 11/45 Power Reduction of RBC Power Reduction of RBC 1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 g g g g e e o o e e e e e e s n o p p a c c g g e e e e d d d d i i i i d d d d d d d d d d d d g g e e a a t t i i i i v v d d m m s s p p p p p p p p o o o o d d g g A A o o o o o o o o m m a e u u x e e j j c c c c c c c c r c d a a d p n n e n e n e n e e d i u e d c e d d d t s e e . m 2 2 w w a o 1 1 p p t t i i . g g a a 2 2 g g w w s a . e a e e r r 7 7 p p g g m s s p p g g e e m m e e m m p p S4k-G4k-32kL1 vs. 32k-DM S4k-G4k-32kL1 vs. 32k-4way S4k-G4k-32kL1 vs. 40k-5way  Dynamic cache power reduced by as much as 63% 12/45 6

  7. Gary Tyson Overview of Talk Overview of Talk  Some of our prior work in this area  Some of our prior work in this area – Region Based Cache Design – Stack Value File  Current Research – Micro-architectural support for Java VM Micro architectural support for Java VM – Application Specific Processor Design – VM improvement (security; linkage; debug) 13/45 Improving Execution Time Improving Execution Time  Region Caches exploit differences in  Region Caches exploit differences in working set size for Stack, Static and Heap regions to reduce power without hurting performance.  Our Stack Value File research exploits specific stack reference characteristics to improve execution performance by developing a new data storage structure. 14/45 7

  8. Gary Tyson Morphing $sp Morphing $sp- -relative References relative References  Morph $sp-relative references into register accesses  Use a Stack Value File (SVF)  Resolve address early in decode stage for stack-pointer indexed accesses stack pointer indexed accesses  Resolve stack memory dependency early  Aliased references are re-routed to SVF 15/45 Microarchitecture Extension Microarchitecture Extension (PIII (PIII- -like) like) Stack Value File accesses occur at the same pipeline stage as register reads same pipeline stage as register reads Issue Execute Commit Reservation Station / L MOB Ld/St Dispatch Fetch Decode Unit DecoderQ Reg Decoder Instr-Cache Renamer Func Unit (RAT) LSQ Morphing Pre-Decode offset ArchRF Max Hash ReOrder Buffer SP Stack SP Value File interlock 16/45 8

  9. Gary Tyson Stack Reference Characteristics Stack Reference Characteristics  Contiguity – Good temporal and spatial locality – Can be stored in a simple, fast structure • Small die area relative to a regular cache • Less power dissipation – No address tag need for each datum • Keep the current TOS address 17/45 Cache Distribution by Region Cache Distribution by Region 1.E+07 set 1.E+06 # of hits to each s 1.E+05 1.E+04 1.E+03 1.E+02 1.E+01 1.E+00 1 512 stack global static heap 18/45 9

  10. Gary Tyson Stack Reference Characteristics Stack Reference Characteristics  First touch is almost always a Store Store – Avoid waste bandwidth to bring in dead data – “ Write Validate ” allocation policy  Deallocated stack frame – Dead data (must be written to next D d d t ( t b itt t t reference) – No need to write them back to memory 19/45 Memory Traffic Memory Traffic  SVF dramatically reduces memory traffic by many orders of magnitude. – For gcc, ~28M (Stack cache  L2) reduced to ~86K (SVF  L1).  Incoming traffic is eliminated because SVF does not allocate a cache line on a SVF does not allocate a cache line on a miss.  Outgoing traffic consists of only those words words that are dirty when evicted (instead of entire cache lines). 20/45 10

  11. Gary Tyson Speedup Potential of Stack Speedup Potential of Stack Value File Value File 1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1.0 bzip2 b i 2 crafty ft eon gap gcc gzip i mcf f parser t twolf lf vortex t perlbmk lb k vpr A Avg 4-wide 8-wide 16-wide 16-wide (gshare)  Assume all references can be morphed  ~30% speedup for a 16-wide with a dual- ported L1 21/45 Why is SVF Faster ? Why is SVF Faster ?  It reduces the load-to-use latency of stack references  It effectively increases the number of memory port by rerouting more than ½ of all memory references to the SVF  It reduces contention in the MOB  More flexibility in renaming stack references 22/45 11

  12. Gary Tyson MemoryLogix MLX1 Processor MemoryLogix MLX1 Processor Taken from: Peter Song Peter Song “MLX1: A Tiny Multithreaded 586 Core for Smart Mobile Devices”. 23/45 Conclusions Conclusions  Microarchitects need to develop new tradeoffs to  Microarchitects need to develop new tradeoffs to achieve required design goals.  By exploiting characteristics of the programming environment, it is possible to design more efficient microarchitectures.  For many embedded applications further improvement can be made by designing custom processor for each can be made by designing custom processor for each application.  Ultimately, success will enable more cycles to be used to solve tough software engineering problems. 24/45 12

  13. Gary Tyson Backup Foils Backup Foils Baseline Microarchitecture Baseline Microarchitecture Issue Execute Commit Reservation Station / L MOB Ld/St Dispatch Fetch Decode Unit DecoderQ Reg Decoder Instr-Cache Renamer Func Unit (RAT) LSQ ArchRF ReOrder Buffer 26/45 13

Recommend


More recommend