An Adaptive Bloom Filter Cache Partitioning Scheme for Multicore Architectures Konstantinos Nikas Computing Systems Laboratory NTU Athens, Greece Matthew Horsnell, Jim Garside Advanced Processor Technologies Group University of Manchester
Introduction • Cores in CMPs typically share some level of the memory hierarchy • Applications compete for the limited shared space • Need for efficient use of the shared cache – Requests to off-chip memory are expensive (latency and power)
Introduction • LRU (or approximations) is typically employed • Partitions the cache implicitly on a demand basis – Application with highest demand gets majority of cache resources – Could be suboptimal (eg. streaming applications) • Thread-blind policy – Cannot detect and deal with inter-thread interference
Motivation • Applications can be classified into 3 different categories [Qureshi and Patt (MICRO ’06)] • High Utility • Applications that continue to benefit significantly as the cache space is increased
Motivation • Low Utility • Applications that do not benefit significantly as the cache space is gradually increased
Motivation • Saturating Utility • Applications that initially benefit as the cache space is increased Target : Exploit the differences in the cache utility of concurrently executed applications
Static Cache Partitioning
Static Cache Partitioning • Two major drawbacks • The system must be aware of each application’s profile • Partitions remain the same throughout the execution – Programs are known to have distinct phases of behaviour • Need for a scheme that can partition the cache dynamically – Acquire the applications’ profile at run-time – Repartition when the phase of an application changes
Dynamic Cache Partitioning • LRU's “stack property” [Mattson et al. 1970] “An access that hits in a N-way associative cache using the LRU replacement policy is guaranteed to hit also if the cache had more than N ways, provided that the number of sets remains the same.”
ABFCP : Overview • Adaptive Bloom Filter Cache Partitioning (ABFCP) I−Cache • Partitioning Module CORE n Partitioning D−Cache – Track misses and hits Module – Partitioning Algorithm . – Replacement support to Shared enforce partitions I−Cache L2 Cache CORE 0 D−Cache DRAM
ABFCP : Tracking system • Far Misses – Misses that would have been hits had the application been allowed to use more cache ways – Tracked by Bloom filters
ABFCP : Partitioning Algorithm • 2 counters per core per cache set – C LRU – C FarMiss • Each core’s allocation can be changed by ± 1 way • Estimate performance loss/gain – -1 way : Hits in the LRU position will become misses perf. loss → C LRU – +1 way : A portion of the far misses will become hits perf. gain → a * C FarMiss , a = (1 - ways/assoc)
ABFCP : Partitioning Algorithm • Select the best partition that maximises performance (hits) • Complexity – cores = 2 → possible partitions = 3 – cores = 4 → possible partitions = 19 – cores = 8 → possible partitions = 1107 – cores = 16 → possible partitions = 5196627 • Linear algorithm that selects the best partition or a good approximation thereof. – N/2 comparisons (worst case) → O(N)
ABFCP : Way Partitioning • Way Partitioning support [Suh et al. HPCA ’02, Qureshi and Patt MICRO ’06] • Each line has a core-id field • On a miss the ways occupied by the miss-causing application are counted – ways_occupied < partition_limit → victim is the LRU line of another application – Otherwise the victim is the LRU line of the miss-causing application
Evaluation • Configuration – 2,4,8 single-issue, in-order cores – Private L1 I and D caches (32KB, 4-way associative, 32B line size, 1 cycle access latency) – Unified shared on-chip L2 cache (4MB, 32-way associative, 32B line size, 16 cycle access latency) – Main memory (32 outstanding requests, 100 cycle access latency) • Benchmarks – 9 apps from JavaGrande + NAS – One application per processor – Simulation stops when one of the benchmarks finishes
Results (Dual core system)
Results (Dual core system)
Results (Quad core system)
Results (Eight core system)
Evaluation • Increasing promise as number of cores increase • Hardware Cost per core – BF arrays (4096 sets * 32b) → 16KB – Counters (4096 sets * 2 counters * 8b) → 8KB – L2 Cache (240KB tags + 4MB data) → 4336KB – 0.55% increase in area • 8-core system – 48KB for the core-ids per cache set – Total overhead 240KB → 5.5% increase over L2
Evaluation
Related Work • Cache Partitioning Aware Replacement Policy [Dybdhal et al. HPC ’06] – Cannot deal with applications with non-convex miss rate curves • Utility-Based cache partitioning [Qureshi and Patt MICRO ’06] – Smaller overhead – Enforces the same partition over all the cache sets
Conclusions • It is important to share the cache efficiently in CMPs • LRU does not achieve optimal sharing of the cache • Cache partitioning can alleviate its consequences • ABFCP – shows increasing promise as the number of cores increase – provides better performance than LRU at a reasonable cost (5.5% increase for an 8-core system achieves similar results to using LRU with a 50% bigger L2 cache)
Any Questions? Thank you!
Utility-Based Cache Partitioning
Utility-Based Cache Partitioning • High hardware overhead Dynamic Set Sampling (monitor only 32 lines) • – Smaller UMONs • Enforce the same partition for the whole cache – Less counters
Utility-Based Cache Partitioning
ABFCP Comparison with UCP • UCP has a lower storage overhead (70KB for an 8-core) • If it attempted to partition on a line basis, it would require 11MB per processor • ABFCP is more robust • ABFCP performs better as the number of cores increases
ABFCP Comparison with UCP
CPARP
Conclusions
Evaluation • UCP acquires a more accurate profile than CPARP • Example – curr_hits = 135 – if app2 gets 6 ways then hits = 145 (UCP) – CPARP does not modify the partition
Recommend
More recommend