Gary Tyson Region Caching: Motivation Region Caching: Motivation High Level Languages influence the memory reference behavior f b h i – Caused by translating complex semantics into simple code (somewhat application independent) Programming conventions also predictably influence memory reference behavior Exploiting memory region characteristics can p g y g lead to more effective caches Attacking each region individually – finding optimal cache designs for each region 1/45 Memory Space Partitioning Memory Space Partitioning max mem Based on programming reserved language Dynamic Data Region Non-overlapped Protected subdivisions Dynamic Data Region Static Data Region Code Region Static Data Region reserved min mem MIPS Architecture 2/45 1
Gary Tyson Memory Space Partitioning Memory Space Partitioning max mem Based on programming reserved language Dynamic Data Region Non-overlapped Protected subdivisions Dynamic Data Region Split code and data I cache & D cache I-cache & D-cache Static Data Region Code Region Static Data Region reserved min mem MIPS Architecture 3/45 Memory Space Partitioning Memory Space Partitioning max mem Based on programming reserved language Stack grows downward Non-overlapped Protected subdivisions Heap grows upward Split code and data I- cache & D cache cache & D-cache Static Global Data Region Split data into regions Code Region – Stack ( ) Read-only data – Heap ( ) reserved min mem MIPS Architecture – Global (static) – Read-only (static) 4/45 2
Gary Tyson Stack Reference of Memory Instructions Stack Reference of Memory Instructions 100% 90% 80% 70% Read-only 60% Heap ref 50% Static ref 40% Stack ref 30% 20% 10% 0% c p f 2 y n p r f x k r g c l p p t o a c i e o e m v f z m g s v i a e g g w t A z r b r r b a t o c l v r p e p 5/45 Stack + Global Stack + Global 100% 90% 80% 70% Read-only 60% Heap ref 50% Static ref 40% Stack ref 30% 20% 10% 0% y c p f 2 n p r f x k r g c l p p t o a c i e o e m v f z m g s w v i a e g g t A z r b r r b a t o l c v r p e p 6/45 3
Gary Tyson Stack + Global + Heap Stack + Global + Heap 100% 90% 80% 70% Read-only 60% Heap ref 50% Static ref 40% Stack ref 30% 20% 10% 0% c p f 2 y n p r f x k r g c l p p t o a c i e o e m v f z m g s v i a e g g w t A z r b r r b a t o c l v r p e p 7/45 Stack + Global + Heap + Read- Stack + Global + Heap + Read - only Data only Data 100% 90% 80% 70% Read-only 60% Heap ref 50% Static ref 40% Stack ref 30% 20% 10% 0% y c p f 2 n p r f x k r g c l p p t o a c i e o e m v f z m g s w v i a e g g t A z r b r r b a t o l c v r p e p 8/45 4
Gary Tyson Region Region- -based Partitioning based Partitioning Run-time virtual memory address space is partitioned by programming languages. Can we reduce power while retaining performance ? Reference patterns and characteristics of these data are different. 9/45 Miss Rate by Memory Region Miss Rate by Memory Region 0.3000 0.2500 0.2000 H eap data miss rate 0.1500 G lobal static data Stack data 0.1000 0.0500 0.0000 256 512 1k 2k 4k 8k 16k 32k 64k Individual C ache Size Stack data level off quickly; so do global data Heap drops linearly every time cache size doubled 10/45 5
Gary Tyson Region Region- -based Cachelets (RBC) based Cachelets (RBC) A simple idea p address A horizontal region select Address DeMultiplexer partitioning Clock gated caches cs cs cs RC1 RC2 stack static Only enable (cycle) L1 Cache the region cachelet the region cachelet being accessed Redirect >70% Processor Data Bus accesses to smaller region cachelets 11/45 Power Reduction of RBC Power Reduction of RBC 1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 g g g g e e o o e e e e e e s n o p p a c c g g e e e e d d d d i i i i d d d d d d d d d d d d g g e e a a t t i i i i v v d d m m s s p p p p p p p p o o o o d d g g A A o o o o o o o o m m a e u u x e e j j c c c c c c c c r c d a a d p n n e n e n e n e e d i u e d c e d d d t s e e . m 2 2 w w a o 1 1 p p t t i i . g g a a 2 2 g g w w s a . e a e e r r 7 7 p p g g m s s p p g g e e m m e e m m p p S4k-G4k-32kL1 vs. 32k-DM S4k-G4k-32kL1 vs. 32k-4way S4k-G4k-32kL1 vs. 40k-5way Dynamic cache power reduced by as much as 63% 12/45 6
Gary Tyson Overview of Talk Overview of Talk Some of our prior work in this area Some of our prior work in this area – Region Based Cache Design – Stack Value File Current Research – Micro-architectural support for Java VM Micro architectural support for Java VM – Application Specific Processor Design – VM improvement (security; linkage; debug) 13/45 Improving Execution Time Improving Execution Time Region Caches exploit differences in Region Caches exploit differences in working set size for Stack, Static and Heap regions to reduce power without hurting performance. Our Stack Value File research exploits specific stack reference characteristics to improve execution performance by developing a new data storage structure. 14/45 7
Gary Tyson Morphing $sp Morphing $sp- -relative References relative References Morph $sp-relative references into register accesses Use a Stack Value File (SVF) Resolve address early in decode stage for stack-pointer indexed accesses stack pointer indexed accesses Resolve stack memory dependency early Aliased references are re-routed to SVF 15/45 Microarchitecture Extension Microarchitecture Extension (PIII (PIII- -like) like) Stack Value File accesses occur at the same pipeline stage as register reads same pipeline stage as register reads Issue Execute Commit Reservation Station / L MOB Ld/St Dispatch Fetch Decode Unit DecoderQ Reg Decoder Instr-Cache Renamer Func Unit (RAT) LSQ Morphing Pre-Decode offset ArchRF Max Hash ReOrder Buffer SP Stack SP Value File interlock 16/45 8
Gary Tyson Stack Reference Characteristics Stack Reference Characteristics Contiguity – Good temporal and spatial locality – Can be stored in a simple, fast structure • Small die area relative to a regular cache • Less power dissipation – No address tag need for each datum • Keep the current TOS address 17/45 Cache Distribution by Region Cache Distribution by Region 1.E+07 set 1.E+06 # of hits to each s 1.E+05 1.E+04 1.E+03 1.E+02 1.E+01 1.E+00 1 512 stack global static heap 18/45 9
Gary Tyson Stack Reference Characteristics Stack Reference Characteristics First touch is almost always a Store Store – Avoid waste bandwidth to bring in dead data – “ Write Validate ” allocation policy Deallocated stack frame – Dead data (must be written to next D d d t ( t b itt t t reference) – No need to write them back to memory 19/45 Memory Traffic Memory Traffic SVF dramatically reduces memory traffic by many orders of magnitude. – For gcc, ~28M (Stack cache L2) reduced to ~86K (SVF L1). Incoming traffic is eliminated because SVF does not allocate a cache line on a SVF does not allocate a cache line on a miss. Outgoing traffic consists of only those words words that are dirty when evicted (instead of entire cache lines). 20/45 10
Gary Tyson Speedup Potential of Stack Speedup Potential of Stack Value File Value File 1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1.0 bzip2 b i 2 crafty ft eon gap gcc gzip i mcf f parser t twolf lf vortex t perlbmk lb k vpr A Avg 4-wide 8-wide 16-wide 16-wide (gshare) Assume all references can be morphed ~30% speedup for a 16-wide with a dual- ported L1 21/45 Why is SVF Faster ? Why is SVF Faster ? It reduces the load-to-use latency of stack references It effectively increases the number of memory port by rerouting more than ½ of all memory references to the SVF It reduces contention in the MOB More flexibility in renaming stack references 22/45 11
Gary Tyson MemoryLogix MLX1 Processor MemoryLogix MLX1 Processor Taken from: Peter Song Peter Song “MLX1: A Tiny Multithreaded 586 Core for Smart Mobile Devices”. 23/45 Conclusions Conclusions Microarchitects need to develop new tradeoffs to Microarchitects need to develop new tradeoffs to achieve required design goals. By exploiting characteristics of the programming environment, it is possible to design more efficient microarchitectures. For many embedded applications further improvement can be made by designing custom processor for each can be made by designing custom processor for each application. Ultimately, success will enable more cycles to be used to solve tough software engineering problems. 24/45 12
Gary Tyson Backup Foils Backup Foils Baseline Microarchitecture Baseline Microarchitecture Issue Execute Commit Reservation Station / L MOB Ld/St Dispatch Fetch Decode Unit DecoderQ Reg Decoder Instr-Cache Renamer Func Unit (RAT) LSQ ArchRF ReOrder Buffer 26/45 13
Recommend
More recommend