leveraging high performance g g g data cache techniques
play

Leveraging High Performance g g g Data Cache Techniques to Save - PDF document

4/26/2012 Leveraging High Performance g g g Data Cache Techniques to Save Power in Embedded Systems Major Bhadauria, Sally A. McKee, Karan Singh, Gary S. Tyson Process Technology Leakage Problem 100,000 Lower Operating Voltage 10,000


  1. 4/26/2012 Leveraging High Performance g g g Data Cache Techniques to Save Power in Embedded Systems Major Bhadauria, Sally A. McKee, Karan Singh, Gary S. Tyson Process Technology Leakage Problem 100,000  Lower Operating Voltage 10,000 Ioff (nA/um) 0.25um 1,000 0.18um 0.13um 100 0.1um  Lower Transistor 10 Threshold  Exponential Increase p 1 30 40 50 60 70 80 90 100 110 In Leakage Temperature (C) Leakage vs. Temperature 1

  2. 4/26/2012 Outline  Cache Power Reduction Solutions  Leakage Issue  Possible Solutions  Our Reuse Distance ( RD) policy  Energy and Delay Performance  Future Work Future Work Cache Power Reduction  Reduce Dynamic Power  Partition caches horizontally via cache banking or region  Partition caches horizontally via cache banking or region caches lee+cases00  Partition cache vertically using filter caches or line buffers kamble+islped97 , kin+ieeetc00  Reduce Static Power  Utilize high-VT threshold transistors  Dynamically turn off dead lines kaxiras+isca01  Dynamically turn off dead lines  Dynamically put to sleep unused lines flautner+isca02 2

  3. 4/26/2012 Region Caches  Partition data cache into: stack global and into: stack, global and heap regions*  Steer accesses to cache structures using virtual address* Multiple Access Caches Target Way-Associative Performance without power overhead: power overhead:  Column-associative caches check secondary cache line on miss, extra bit to indicate whether tag line hashed  MRU two-way associative caches check cache ways sequentially rather than parallel, extra bit for MRU way 3

  4. 4/26/2012 Leakage Reduction  High-VT Static Solution  Replace transistors with high-VT ones  Static increase in latency  Gated-VDD Decay Caches (State Losing)  Turn off unused cache lines (loses data)  Requires sleeper transistors  Adaptive Body Biasing (ABB) & Drowsy Caches (Retain State)  Significant delay and dynamic power consumption between wakeup for ABB b t k f ABB  Requires special manufacturing process for ABB  DVS for leakage reduction with drowsy caches  Extra circuitry required for both Previous Drowsy Leakage Policies  Simple  Turn off all cache lines every X cycles  Little overhead, power/performance is variable  No Access  Turn off cache line if not accessed within X cycles  Counters required per cache line  Reuse Most Recently On (RMRO)  Reuse Most Recently On (RMRO)  No Access policy specifically for cache ways  Requires some bits per cache set, only 1 counter 4

  5. 4/26/2012 Reuse Distance (RD) Policy  Measures time using cache accesses to increment counters increment counters  Keeps only the last N accesses “awake” for an RD of size N  Ensures only N lines are ever awake  Clock cycle independent Clock cycle independent  Gives upper bound for power envelope Reuse Distance (RD) LRU  True LRU too expensive, substitute with:  Quasi-LRU via saturating counters  Close approximations via timestamps Cl i ti i ti t LRU Cache Line Counter RD N=4 Cache Accesses Check Drowsy Misses These Bits Increment These Bits 0 7 1 2 1 1 23 23 2 2 3 3 99 833 2 3 0 3 832 0 1 5

  6. 4/26/2012 We Apply  Region caches with the heap cache size reduced by half multiple access cache to reduced by half, multiple access cache to retain performance  Drowsy cache using the RD policy  Target embedded architecture and applications Experimental Setup  Alpha 21264 Architecture/ISA,  HotLeakage Simulator  HotLeakage Simulator  1.5GHz, 70nm, 80 degrees Simulator Parameters  SPEC2000 Benchmarks Using SimPoints  2 Level Cache Hierarchy  32KB 32 byte 4-Way L1 D-Cache (1 cycle)  4-Way Unified L2 256KB/512KB/1MB/2MB  Drowsy Policies  Drowsy Policies  Simple Policy 4K Cycles (NoAccess omitted)  RMRO 256  RD 15 6

  7. 4/26/2012 Column Associative MRU 7

  8. 4/26/2012 Reuse Coverage Performance simple RD 0.992 0.99 0.988 PC Normalized to DM Simple 0.986 0.984 0.982 0.98 0.978 0.976 IP 0.974 0.972 0.97 CA MRU 8

  9. 4/26/2012 Dynamic Energy simple 2-way associative simple column associative simple MRU 1.4 1 2 1.2 Consumption Normalized Simple Direct-Mapped 1 0.8 0.6 Power to 0.4 0.2 0 Static Energy simple RD 0.12 0.1 0 1 Leakage normalized to DM Non-Drowsy Cache 0.08 0.06 0.04 0.02 0 heap stack global Region 9

  10. 4/26/2012 Total Power Consumption simple RD 0.5 M Cache 0.45 Normalized to Non-Drowsy DM 0.4 0.35 0.3 0.25 0.2 0.15 Total Power 0.1 0.05 0 DM CA MRU Conclusion  Cache Power Reductions  Dynamic power reductions achieved via multiple y p p access caches  Significant leakage reduction through RD policy  Minimal performance degradation  Future Work  Investigate cache interaction in CMP systems  Investigate cache interaction in CMP systems  Use compiler hints for static cache assignments 10

  11. 4/26/2012 Q&A 11

Recommend


More recommend