an adaptive bloom filter cache partitioning scheme for

An Adaptive Bloom Filter Cache Partitioning Scheme for Multicore - PowerPoint PPT Presentation

An Adaptive Bloom Filter Cache Partitioning Scheme for Multicore Architectures Konstantinos Nikas Computing Systems Laboratory NTU Athens, Greece Matthew Horsnell, Jim Garside Advanced Processor Technologies Group University of Manchester

  1. An Adaptive Bloom Filter Cache Partitioning Scheme for Multicore Architectures Konstantinos Nikas Computing Systems Laboratory NTU Athens, Greece Matthew Horsnell, Jim Garside Advanced Processor Technologies Group University of Manchester

  2. Introduction • Cores in CMPs typically share some level of the memory hierarchy • Applications compete for the limited shared space • Need for efficient use of the shared cache – Requests to off-chip memory are expensive (latency and power)

  3. Introduction • LRU (or approximations) is typically employed • Partitions the cache implicitly on a demand basis – Application with highest demand gets majority of cache resources – Could be suboptimal (eg. streaming applications) • Thread-blind policy – Cannot detect and deal with inter-thread interference

  4. Motivation • Applications can be classified into 3 different categories [Qureshi and Patt (MICRO ’06)] • High Utility • Applications that continue to benefit significantly as the cache space is increased

  5. Motivation • Low Utility • Applications that do not benefit significantly as the cache space is gradually increased

  6. Motivation • Saturating Utility • Applications that initially benefit as the cache space is increased Target : Exploit the differences in the cache utility of concurrently executed applications

  7. Static Cache Partitioning

  8. Static Cache Partitioning • Two major drawbacks • The system must be aware of each application’s profile • Partitions remain the same throughout the execution – Programs are known to have distinct phases of behaviour • Need for a scheme that can partition the cache dynamically – Acquire the applications’ profile at run-time – Repartition when the phase of an application changes

  9. Dynamic Cache Partitioning • LRU's “stack property” [Mattson et al. 1970] “An access that hits in a N-way associative cache using the LRU replacement policy is guaranteed to hit also if the cache had more than N ways, provided that the number of sets remains the same.”

  10. ABFCP : Overview • Adaptive Bloom Filter Cache Partitioning (ABFCP) I−Cache • Partitioning Module CORE n Partitioning D−Cache – Track misses and hits Module – Partitioning Algorithm . – Replacement support to Shared enforce partitions I−Cache L2 Cache CORE 0 D−Cache DRAM

  11. ABFCP : Tracking system • Far Misses – Misses that would have been hits had the application been allowed to use more cache ways – Tracked by Bloom filters

  12. ABFCP : Partitioning Algorithm • 2 counters per core per cache set – C LRU – C FarMiss • Each core’s allocation can be changed by ± 1 way • Estimate performance loss/gain – -1 way : Hits in the LRU position will become misses perf. loss → C LRU – +1 way : A portion of the far misses will become hits perf. gain → a * C FarMiss , a = (1 - ways/assoc)

  13. ABFCP : Partitioning Algorithm • Select the best partition that maximises performance (hits) • Complexity – cores = 2 → possible partitions = 3 – cores = 4 → possible partitions = 19 – cores = 8 → possible partitions = 1107 – cores = 16 → possible partitions = 5196627 • Linear algorithm that selects the best partition or a good approximation thereof. – N/2 comparisons (worst case) → O(N)

  14. ABFCP : Way Partitioning • Way Partitioning support [Suh et al. HPCA ’02, Qureshi and Patt MICRO ’06] • Each line has a core-id field • On a miss the ways occupied by the miss-causing application are counted – ways_occupied < partition_limit → victim is the LRU line of another application – Otherwise the victim is the LRU line of the miss-causing application

  15. Evaluation • Configuration – 2,4,8 single-issue, in-order cores – Private L1 I and D caches (32KB, 4-way associative, 32B line size, 1 cycle access latency) – Unified shared on-chip L2 cache (4MB, 32-way associative, 32B line size, 16 cycle access latency) – Main memory (32 outstanding requests, 100 cycle access latency) • Benchmarks – 9 apps from JavaGrande + NAS – One application per processor – Simulation stops when one of the benchmarks finishes

  16. Results (Dual core system)

  17. Results (Dual core system)

  18. Results (Quad core system)

  19. Results (Eight core system)

  20. Evaluation • Increasing promise as number of cores increase • Hardware Cost per core – BF arrays (4096 sets * 32b) → 16KB – Counters (4096 sets * 2 counters * 8b) → 8KB – L2 Cache (240KB tags + 4MB data) → 4336KB – 0.55% increase in area • 8-core system – 48KB for the core-ids per cache set – Total overhead 240KB → 5.5% increase over L2

  21. Evaluation

  22. Related Work • Cache Partitioning Aware Replacement Policy [Dybdhal et al. HPC ’06] – Cannot deal with applications with non-convex miss rate curves • Utility-Based cache partitioning [Qureshi and Patt MICRO ’06] – Smaller overhead – Enforces the same partition over all the cache sets

  23. Conclusions • It is important to share the cache efficiently in CMPs • LRU does not achieve optimal sharing of the cache • Cache partitioning can alleviate its consequences • ABFCP – shows increasing promise as the number of cores increase – provides better performance than LRU at a reasonable cost (5.5% increase for an 8-core system achieves similar results to using LRU with a 50% bigger L2 cache)

  24. Any Questions? Thank you!

  25. Utility-Based Cache Partitioning

  26. Utility-Based Cache Partitioning • High hardware overhead Dynamic Set Sampling (monitor only 32 lines) ‏ • – Smaller UMONs • Enforce the same partition for the whole cache – Less counters

  27. Utility-Based Cache Partitioning

  28. ABFCP Comparison with UCP • UCP has a lower storage overhead (70KB for an 8-core) ‏ • If it attempted to partition on a line basis, it would require 11MB per processor • ABFCP is more robust • ABFCP performs better as the number of cores increases

  29. ABFCP Comparison with UCP

  30. CPARP

  31. Conclusions

  32. Evaluation • UCP acquires a more accurate profile than CPARP • Example – curr_hits = 135 – if app2 gets 6 ways then hits = 145 (UCP) ‏ – CPARP does not modify the partition


More recommend