A New Cache Monitoring Scheme for Memory-Aware Scheduling and Partitioning G. Edward Suh Srinivas Devadas Larry Rudolph Massachusetts Institute of Technology February 5, 2002 HPCA-8: 1
Problem • Memory system performance is critical • Everyone thinks about their own application – Tuning replacement policies – Software/hardware prefetching • But modern computer systems execute multiple applications concurrently/simultaneously – Time-shared systems • Context switches cause cold misses – Multiprocessors systems sharing memory hierarchy (SMP, SMT, CMP) • Simultaneous applications compete for cache space February 5, 2002 HPCA-8: 2
Solutions: Cache Partitioning & Memory-Aware Scheduling • Cache Partitioning – Explicitly manage cache space allocation amongst concurrent/ simultaneous processes • Each process gets different benefit from more cache space • Similar to main memory partition (e.g.. Stone 1992) in the old days • Memory-Aware Scheduling – Choose a set of simultaneous processes to minimize memory/cache contention – Schedule for SMT systems (Snavely 2000) • Threads interact in various ways (RUU, functional units, caches, etc) • Based on executing various schedules and profiling them – Admission control for gang scheduling (Batat 2000) • Based on the footprint of a job (total memory usage) February 5, 2002 HPCA-8: 3
BUT… • Testing many possible schedules � not viable – The number of possible schedules increase exponentially as the number of processes increase – Need to decide a good schedule from individual process characteristics � complexity increases linearly • Footprint-based scheduling � not enough information – Footprint of a process is often larger than the cache – Processes may not need the entire working set in the cache • Can we find a good schedule for cache performance? – What information do we need for each process? February 5, 2002 HPCA-8: 4
Information a Scheduler/Partitioner Needs • Characterizing a process – For scheduling and partitioning, need to know the effect of varying cache size • Multiple performance numbers for different cache sizes • Ignore other effects than cache size • Miss-rate curves; m(c) 1 – Cache miss-rates as a function of cache 0.8 size (cache blocks) Miss-rate 0.6 • Assume a process is isolated • Assume the cache is FULLY-ASSOCIATIVE 0.4 – Provides essential information for 0.2 scheduling and partitioning 0 0 50 100 Cache Space (%) February 5, 2002 HPCA-8: 5
Using Miss-Rate Curves for Partitioning • What do miss-rate curves tell about cache allocation? 1 1 Process A Process B 0.8 0.8 Cache misses � Miss-rate 0.6 Miss-rate 0.6 m A (c A )·ref A + m B (c B )·ref B 0.4 0.4 0.2 0.2 0 0 0 50 100 0 50 100 Cache Space (%) Cache Space (%) Cache Allocation c A c B A B February 5, 2002 HPCA-8: 6
Finding the best allocation • Use marginal gain; g(c) = m(c) ·ref - m(c+1)·ref – Gain in the number of misses by increasing the cache space • Allocate cache blocks to each process in a greedy manner – Guaranteed to result in the optimal partition if m(c) are convex Compare Marginal Gains Compare Marginal Gains Compare Marginal Gains Compare Marginal Gains Initially no cache block Allocate a block to Allocate a block to Allocate a block to Allocate a block to 2500 Process A 987 > 1568 987 < 2111 Process B Process B is allocated 987 > 746 409 < 746 Process B Process A 2111 2000 Process B Marginal Gain (Hits) 1568 1500 1000 987 746 Cache Allocation 500 409 282 250 104 0 A B B B 0 1 2 3 4 Cache Space (Blocks) February 5, 2002 HPCA-8: 7
Partitioning Results • Partition the L2 cache amongst two simultaneous processes (spec2000 benchmarks: art and mcf ) 2.5 2 1.5 LRU IPC Partition 1 0.5 0 0.25 0.5 1 2 4 L2 Size (MB) February 5, 2002 HPCA-8: 8
Intuition for Memory-Aware Scheduling • How to schedule 4 processes on 2 processor system using individual miss-rate curves? Curves tend to have a knee 1 1 Process A Process B � The amount of cache 0.8 0.8 space where the marginal Miss-rate Miss-rate 0.6 0.6 • Working set size is larger than gain diminishes a lot 0.4 0.4 the cache for all processes 0.2 0.2 0 0 0 50 100 0 50 100 Group processes based on • All processes result in similar Cache Space (%) Cache Space (%) 1 1 the knees miss-rate if they have the entire Process C Process D 0.8 0.8 cache Miss-rate Miss-rate 0.6 0.6 Schedule A and C, and B 0.4 0.4 and D together 0.2 0.2 0 0 0 50 100 0 50 100 Cache Space (%) Cache Space (%) February 5, 2002 HPCA-8: 9
Determining the Knee of the Curve • Use partitioning technique 2500 Process A 2111 2000 Cache Allocation Process B Marginal Gain (Hits) 1568 1500 1000 987 746 Cache Allocation 500 409 282 250 104 0 0 1 2 3 4 Cache Space (Blocks) • However, now we may need multiple time slices to schedule processes (2 time slices in our example) • Available cache resource should be doubled February 5, 2002 HPCA-8: 10
Scheduling Results • Schedule 6 SPEC CPU benchmarks for 2 Processors Worst Best Scheduling Algorithm 2 Normalized Miss-rate 1.5 1 0.5 0 8 16 32 64 128 256 Memory Size (MB) February 5, 2002 HPCA-8: 11
Analytical Model (ICS`01) • Miss-rate curves (or marginal gains) alone may not be enough for optimizing time-shared systems – Partitioning amongst concurrent processes – Scheduling considering the effects of context switches • Use analytical model to predict cache-sharing effects 0.05 LRU 32-KB 8-way Set- 0.045 Partition Associative 0.04 (bzip2+gcc+swim+ Miss-rate mesa+vortex+vpr+t 0.035 wolf+iu) 0.03 0.025 0.02 1 10 100 1000 10000 100000 1000000 Time Quantum (# of cache accesses) February 5, 2002 HPCA-8: 12
BUT… • Processes to execute are only known at run-time – Users decide what applications to run – Scheduling/Partitioning decisions should be made at run-time • The behavior of a process changes over time – Applications have different phases – Miss-rates curves (and marginal gains) may change over an execution • Cache configurations are different for systems – Miss-rate curves (and marginal gains) are different for systems • Need an on-line estimation of miss-rate curves (and marginal gains) February 5, 2002 HPCA-8: 13
On-Line Estimation of Marginal Gains: Fully-Associative Caches • Marginal gains can be directly counted based on the temporal ordering of cache blocks (LRU information) 1000 913 – Use one counter per each cache block (or a group of cache 750 722 Marginal Gain blocks) and one for counting all accesses 500 – Hit on the i th MRU � Increment i th counter 351 250 • Example: a FA cache with 4 blocks 124 0 0 1 2 3 Increment Increment Access Cache Space (Blocks) 2433 2432 2432 the 1 st the 3 th Counter 2433 2434 Counter Counter Marginal-Gain 912 912 722 350 350 124 Hit on the 3 rd Hit on 913 351 Counters MRU Cache the MRU Cache Block Block Cache LRU LRU LRU LRU LRU LRU LRU 1 2 0 1 0 2 3 Order Order Order Order Order Order Order Blocks February 5, 2002 HPCA-8: 14
BUT… • Most caches are SET-ASSOCIATIVE – Except main memory – Usually up to 8-way associative • Set-associative caches only maintain temporal ordering within a set – No global temporal ordering • Cannot use block-by-block temporal ordering to obtain marginal gains for fully-associative caches February 5, 2002 HPCA-8: 15
Way-Counters 1 1 • Way-Counters Way-Counter 0.8 Fully-Associative – Use the existing LRU information within a set Miss-Rate 0.6 – One counter per way (D-way cacehs � D counters) 0.4 – Hit on the i th MRU � Increment i th counter 0.2 0.123 0.0477 • Each way-counter represents the gain of having S more 0.0234 0.0171 0 0 256 512 768 1024 Increment Increment blocks (S is the number of sets) Cache Size (Blocks) the 1 st the 2 nd Counter Counter Hit on Way Access the MRU 4384 4385 377 376 121 31 5014 5014 5012 5013 Counters Counter Cache Block 1 0 2 3 4-way Hit on Associative S sets the 2 nd MRU Cache … … … … Cache Block 1 0 1 0 3 3 2 2 February 5, 2002 HPCA-8: 16
Way+Set Counters • Use more counters for more detailed information – Maintain the LRU information of sets – Hit on the i th MRU way and j th MRU set � Increment counter(i,j) 1 Access Hit on Way-Counter (2-way) Counters Way+Set (8 Groups) Counter the MRU way Way+Set (16 Groups) the 2 nd MRU group 1 0 Fully-Associative 0.8 Group 0 2132 377 5249 5248 Increment 0 1 0 1 the Counter 0.6 Miss-Rate (0,1) 2-way 0 1 Temporal Ordering Group 1 Associative 1073 1074 283 0.4 Cache Of 0 1 0 1 Set Groups 0.2 … … … … 1 0 0 Group S’ 0 256 512 768 1024 431 31 Cache Size (Blocks) 8 0 1 February 5, 2002 HPCA-8: 17
Recommend
More recommend