comparison of cache replacement policies using
play

Comparison of Cache Replacement Policies using Teammates - - PowerPoint PPT Presentation

Comparison of Cache Replacement Policies using Teammates - Bhagyashree GEM5 - Nivin Simulator - Sri Divya Performance Bottleneck The performance gap between CPU and Dram has increased drastically. It leads to producer consumer


  1. Comparison of Cache Replacement Policies using Teammates - Bhagyashree GEM5 - Nivin Simulator - Sri Divya

  2. Performance Bottleneck • The performance gap between CPU and Dram has increased drastically. • It leads to producer consumer problem. • How do we fix this ?

  3. Ideal Cache What are we looking for ? - Cache that is big and fast. - Provides good temporal & spatial locality. - Cheap to buy. - Easier to construct.

  4. Big and Fast Memory • We introduced memory hierarchy to solve big and fast memory. • Now process can have memory to size of HDD and fast as Registers. • For simplicity lets look at the memory hierarch as a sequence <1, 2, 3, 4> • Lower number denotes faster cache.

  5. More about memory Lets model the problem to Consumer and Producer: (Mem) Processor Processor (consumer) (consumer) (Mem) - Consider the sequence <1, 2, 3, 4> increasing order of memory. - Seq1: <1, 2, 3, 4> makes perfect sense. - Seq2: <4, 3, 2, 1> is equivalent as <4> right ? - Seq3: <1, 3, 2, 4> does this improve the performance ?

  6. Research Idea 1: Increase the cache levels Level 1 • Previous discussion led to why cache level is limited 3 or 4. Level 2 • Is the cost the only factor stopping it from doing ?. Level 3 Level n

  7. Cache Replacement Policy - Cache full, CRP algorithm decides best to discard. - More about Locality - Temporal– “a resource that is referenced at one point in time will be referenced again sometime in the near future.” – eg: Web cache - Spatial - ” likelihood of referencing a resource is higher if a resource near it was just referenced” – eg: Matrix Multiplication

  8. • Is it valid to say LRU is more of a temporal locality solver. Research Idea • Simulation to the rescue. 2: Cache • What else contribute to spatial locality - * think of larger Replacement block size. Favors a Temp Spatial oral specific locality

  9. • From the previous idea, we want mix cache replacement Research Idea algorithms at different levels and study its performance. 3: Mixing of • One can argue that: Cache • Going back to Sequences <1, 2, ,3, 4> • One can argue that higher sequence number influences the rest. Replacement • Hmm, Eg: lets take matrix multiplication: algorithm in • Let Level 1 favor spatial locality and level 2 favor temporal one. different Level 1 - LRU levels Level 2 - Random Level 3 Level n

  10. Cache Associative or Set: • Fully Associative – the best miss rate, ( set size of 2^k) • Set Associative – (Intermediate) • Direct-Mapped (set size from 1) • Larger sets and higher associativity lead to fewer cache conflicts and lower miss rates, but they also increase the hardware cost.

  11. Research Idea 4: Cache associativity Intuition: Have higher set value for lower cache levels. Lets forget the cost for now ? But does reversing it gives you better performance. Coming back to seq: Cache sequence: <1, 2, 3, 4> Set sequence : <N, N-1, N-2> Reversing it:

  12. Research Idea 5: Lets combine (Flow Diagram) cache performance - Combining Let’s represent all the above hardware research idea into a tree. Application - Compare it to OPT and study where each algorithm stands. cost Complexity locality measurement mix Set associativity Levels benchmark

  13. Common Cache Replacement Policies LRU:

  14. • LRU: • Expensive in terms of speed and hardware. • Need to remember the order in which all N lines were accessed. • N! scenarios – O(log N!) LRU bits Common Cache • 2-ways → AB BA = 2 = 2! Replacement Policies • 3-ways → ABC ACB BAC BCA CAB CBA = 6 = 3! • Pseudo LRU: O(N) • Approximates LRU policy with a binary tree.

  15. Common Cache Replacement Policies Pseudo LRU:

  16. • Discards in contrast to LRU, the most recently used items first. Common Cache • MRU algorithms are most useful in Replacement Policies situations where the older an item is, MRU: the more likely it is to be accessed.

  17. Random Replacement • simpler, but at the cost of performance. Round Robin (or FIFO) Replacement Common Cache • Replacing oldest block in cache memory. Circular counter. Replacement Policies • Each cache memory set is accompanied with a circular counter which points to the next cache block to be replaced; the counter is updated on every cache miss.

  18. Common Cache Replacement Policies: L-1 Adaptive Cache Replacement: L-2 T1 |T1| + |T2| = C T2 Ghost Caches (Not in Memory) B1 B2

  19. Common Cache Replacement Policies • Clock with Adaptive Replacement: • Combines the advantages of ARC and Clock. • It used 4 doubly-linked lists : two clocks ( T1 & T2) and 2 simple LRU lists (B1 & B2). • T1 clock stores pages based on “recency” and T2 stores pages based on “frequency”. • B1 & B2 contain pages that have recently been evicted from T1 & T2 respectively.

  20. Simulation and Benchmark • Why do we need system simulator ? • CPU behavior depends on memory system and the behavior of memory system depends on the CPUs. • We choose gem5 over other simulators, as it is much easier to perform different measurements on cache replacement policies. • SPEC CPU benchmarks will be used for performance evaluation.

  21. Gem5 simulator • Gem5 = m5 + gems • a modular discrete event driven computer system simulator platform • Rich availability of modules in the framework.

  22. • GEM5 has an open source license, a good object-oriented infrastructure and a very active mailing list. 1. System Emulation • System Modes – 2. Full System 1. Atomic 2. Timing Simple • CPU models – Simple 3. In-order 4. Out-of-order 1. Classic Model (M5) • Memory System – 2. Ruby Model (GEMS) • Supported ISAs – ALPHA, ARM, X86, PowerPC, SPARC, MIPS

  23. • Flexibility • Availability • Collaboration • Example - • class L2Cache(Cache): size = '256kB' assoc = 8 hit_latency = 20 response_latency = 20 • system.cpu.apic_clk_domain.c lock 16000 # Clock period in ticks • system.cpu.numCycles 345518 # number of cpu cycles simulated

  24. • http://www.ece.uah.edu/~milenka/docs/milenkovic_acmse04r. pdf • http://www.cs.ucf.edu/~neslisah/final.pdf • http://www.cse.scu.edu/~mwang2/projects/Cache_replacemen t_10s.pdf • https://ece752.ece.wisc.edu/lect11-cache-replacement.pdf • https://en.wikipedia.org/wiki/Cache_replacement_policies • http://people.csail.mit.edu/emer/papers/2010.06.isca.rrip.pdf • http://www.cs.utexas.edu/users/mckinley/papers/evict-me-pac t-2002.pdf • http://snir.cs.illinois.edu/PDF/Temporal%20and%20Spatial%20L References ocality.pdf • https://math.mit.edu/~stevenj/18.335/ideal-cache.pdf • https://pdfs.semanticscholar.org/6ebe/c8701893a6770eb0e19a 0d4a732852c86256.pdf • https://fD3hhNnfL6kwww.youtube.com/watch?v= • http://pages.cs.wisc.edu/~david/courses/cs752/Spring2015/ge m5-tutorial/index.html • http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/ • https://github.com/dependablecomputinglab/csi3102-gem5-ne w-cache-policy

Recommend


More recommend