analytical cache models with applications to cache
play

Analytical Cache Models with Applications to Cache Partitioning G. - PowerPoint PPT Presentation

Analytical Cache Models with Applications to Cache Partitioning G. Edward Suh, Srinivas Devadas, and Larry Rudolph LCS, MIT Motivation Memory system performance is critical Everyone thinks about their own application But modern


  1. Analytical Cache Models with Applications to Cache Partitioning G. Edward Suh, Srinivas Devadas, and Larry Rudolph LCS, MIT

  2. Motivation � Memory system performance is critical � Everyone thinks about their own application � But modern computer systems execute multiple applications concurrently/simultaneously � Context switches cause cold misses � Simultaneous applications compete for cache space � Caches should be managed more carefully, considering multiple processes � Explicit management of cache space => partitioning � Cache-aware job schedulers

  3. Related Work � Analytical Cache Models � Thiébaut and Stone (1987) � Agarwal, Horowitz and Hennessy (1989) � Both only focus on long time quanta � Inputs are hard to obtain on-line � Cache Partitioning Stone, Turek and Wolf (1992) � � Optimal cache partitioning for very short time quanta � Our Model & Partitioning � Work for any time quantum � Inputs are easier to obtain (possible to estimate on-line)

  4. Our Multi-tasking Cache Model � Input Miss-Rate Miss-Rate Miss-Rate � C: Cache Size � Schedule: job sequences with Cache Size Cache Size time quantum (T A ) C Schedule M A (x) � M A (x): a miss rate as a function of cache size for Process A Cache Model � Output � Overall miss-rate (OMR) for multi-tasking Overall Miss Rate

  5. Assumptions � The miss-rate of a process is a function of cache size alone, not time � One MR(size) per application � Curve is averaged over application lifetime � In cases of high variance � Split the application into phases � One MR(size) per phase � Generated off-line (or on-line with HW support) � No shared memory space among processes

  6. Assumptions: Cont. � Fully-associative caches � Extended to set-associative caches (memo 433) � The fully-associative model works for set- associative cache partitioning � LRU replacement policy � Time in terms of the number of memory references � The number of memory reference can be easily converted to real time in a steady-state

  7. Independent Footprint x A Φ (t) � Independent footprint � The amount of data for Process A at time t starting from an empty cache, x A Φ (0) = 0 � Assume only one process executes � Changes � If hit, x A Φ (t+1) = x A Φ (t) � If miss, x A Φ (t+1) = MIN[ x A Φ (t) + 1, C ] � If we approximate real value of x A Φ (t) with its expectation: � E[x A Φ (t+1)] = MIN[ E[x A Φ (t)] + P A (t), C ] = MIN[ E[x A Φ (t)] + M A (E[x A Φ (t)]), C ]

  8. Dependent Footprint x A (t) � Dependent footprint � The amount of data for Process A when multiple processes concurrently execute � Obtained from the given schedule and the independent footprint of all processes � Example � Four processes: A, B, C, D � round-robin schedule: ABCDABCD…

  9. Dependent Footprint x A (t): Cont. An infinite size cache when Process A is executed for time t MRU Data a t a A 0 D -1 C -1 B -1 A -1 D -2 C -2 B -2 A -2 D -3 C -3 … D U R L Φ (t ) x A Φ (t+T A )- x A Φ (t ) x A Independent Footprint of A Blocks � Compute block sizes from x A Φ (t) left: A 0 ,D -1 ,C -1 ,B -1 ,A -1 ,D -2 ,… � Use independent footprint � Until cache is full t t+T A Time

  10. Dependent Footprint x A (t): Cont. An infinite size cache when Process A is executed for time t MRU Data a t a A 0 D -1 C -1 B -1 A -1 D -2 C -2 B -2 A -2 D -3 C -3 … D U R L Cache Size (C) � Case 1: dormant process’ block is the LRU � x A (t) = A 0 + A -1 = x A Φ (t+T A )

  11. Dependent Footprint x A (t): Cont. An infinite size cache when Process A is executed for time t MRU Data a t a A 0 D -1 C -1 B -1 A -1 D -2 C -2 B -2 A -2 D -3 C -3 … D U R L Cache Size (C) � Case 1: dormant process’ block is the LRU � x A (t) = A 0 + A -1 = x A Φ (t+T A ) � Case 2: active process’ block is the LRU � x A (t) = C-(D 0 +C 0 +B 0 +D -1 +C -1 +B -1 ) = C-x D Φ (T D )-x C Φ (T C )- x B Φ (T B )

  12. Computing the Miss Probability: P A (t) � Effective cache size Process A’s Data � x A (t): The amount of x A (t) data in a cache for Other Process’ Data process A at time t Cache at time t Miss-Rate � The probability to M A (x) miss at time t � P A (t) = M A (x A (t)) P A (t) x A (t) Cache Size

  13. Estimating Miss-Rate � Miss-rate of Process A Probability to Miss � In a steady-state, all time quanta of Process A are P A (t) identical � Time starts (t=0) at the Integrate beginning of a time quantum 1 T A T ∫ = A mr P (t)dt Time � A A T The number of misses 0 A � Overall miss-rate (OMR) � Weighted sum of each process’ miss-rate

  14. Model Summary Φ 1 + 1 N E [ x ( t 1 )] Cache T ∫ ∑ A M (x ( t ) )dt ⋅ A mr T A i i T Φ Φ snapshot = + T 0 E [ x ( t )] M ( x ( t )) = A i 1 A A A sum Miss-rate Curve Miss-rate IF x A Φ (t) DF x A (t) M A (x) mr A OMR Schedule Miss-rate Curve Miss-rate IF x B Φ (t) DF x B (t) M B (x) mr B (t) Schedule

  15. Model vs. Simulation: 2 Processes Miss-rate (vpr+vortex, 32KB) 0.044 Simulation 0.042 Model 0.04 Miss-rate 0.038 0.036 0.034 0.032 0.03 0 20000 40000 60000 80000 100000 Time Quantum

  16. Model vs. Simulation: 4 Processes Miss-rate (vpr+vortex+gcc+bzip2, 32KB) 0.07 Simulation 0.065 Model 0.06 Miss-rate 0.055 0.05 0.045 0.04 0 20000 40000 60000 80000 100000 Time Quantum

  17. Cache Partitioning � Time-sharing degrades the cache performance significantly for some time quanta � Due to dumb allocation by LRU policy � Could be improved by explicit cache partitioning � Specifying a partition � Dedicated Area (D A ) � Cache blocks that only Process A can use � Shared Area (S) � Cache blocks that any process can use while it is active

  18. Strategy � Off-line profiling of MR(size) curves � One for each phase � Independent of other processes � Can also be obtained on-line with HW support � On-line partitioning � Partitioning decision based on the model � Modify the LRU policy to partition the cache

  19. Optimal Cache Partition � Dedicated areas (D A ) specify the initial amount of data for each process � x A (0) = D A � Shared (S) and dedicated (D A ) areas specify the maximum cache space for each process � C A = D A + S � The model can estimate the miss-rate for a given partition � Use a gradient based search algorithm

  20. Simulation Results: Fully-Associative Caches 32-KB Fully-Associative (bzip2+gcc+swim+mesa+vortex+vpr+twolf+iu) 0.05 LRU � 25% miss-rate 0.045 Partition improvement in 0.04 the best case Miss-rate 0.035 � 7% improvement for short time 0.03 quanta 0.025 0.02 1 10 100 1000 10000 100000 1000000 Time Quantum

  21. From Full to Partial Associative � Use the fully-associative model and curves to determine D A , S � Modify the LRU replacement policy to partition � Count the number of cache blocks for each process (X A ) � Try to match X A to the allocated cache space � Replacement (Process A active) ≥ + X D S � Replace Process A’s LRU block if A A X ≥ D � Replace Process B’s LRU block if B B � Replace the standard LRU block if there is no over-allocated process � Add a small victim cache (16 entries)

  22. Simulation Results: Set-Associative Caches 32-KB 8-way Set-Associative (bzip2+gcc+swim+mesa+vortex+vpr+twolf+iu) 0.05 � 15% miss-rate LRU 0.045 improvement in Partition the best case 0.04 Miss-rate � 4% improvement 0.035 for short time 0.03 quanta 0.025 0.02 1 10 100 1000 10000 100000 1000000 Time Quantum

  23. Summary � Analytical cache model � Very accurate, yet tractable � Works for any cache size and time quanta � Applicable to set-associative cache partitioning � Applications � Dynamic cache partitioning with on-line/off-line approximations of miss-rate curves � Various scheduling problems

Recommend


More recommend