4/26/2012 Leveraging High Performance g g g Data Cache Techniques to Save Power in Embedded Systems Major Bhadauria, Sally A. McKee, Karan Singh, Gary S. Tyson Process Technology Leakage Problem 100,000 Lower Operating Voltage 10,000 Ioff (nA/um) 0.25um 1,000 0.18um 0.13um 100 0.1um Lower Transistor 10 Threshold Exponential Increase p 1 30 40 50 60 70 80 90 100 110 In Leakage Temperature (C) Leakage vs. Temperature 1
4/26/2012 Outline Cache Power Reduction Solutions Leakage Issue Possible Solutions Our Reuse Distance ( RD) policy Energy and Delay Performance Future Work Future Work Cache Power Reduction Reduce Dynamic Power Partition caches horizontally via cache banking or region Partition caches horizontally via cache banking or region caches lee+cases00 Partition cache vertically using filter caches or line buffers kamble+islped97 , kin+ieeetc00 Reduce Static Power Utilize high-VT threshold transistors Dynamically turn off dead lines kaxiras+isca01 Dynamically turn off dead lines Dynamically put to sleep unused lines flautner+isca02 2
4/26/2012 Region Caches Partition data cache into: stack global and into: stack, global and heap regions* Steer accesses to cache structures using virtual address* Multiple Access Caches Target Way-Associative Performance without power overhead: power overhead: Column-associative caches check secondary cache line on miss, extra bit to indicate whether tag line hashed MRU two-way associative caches check cache ways sequentially rather than parallel, extra bit for MRU way 3
4/26/2012 Leakage Reduction High-VT Static Solution Replace transistors with high-VT ones Static increase in latency Gated-VDD Decay Caches (State Losing) Turn off unused cache lines (loses data) Requires sleeper transistors Adaptive Body Biasing (ABB) & Drowsy Caches (Retain State) Significant delay and dynamic power consumption between wakeup for ABB b t k f ABB Requires special manufacturing process for ABB DVS for leakage reduction with drowsy caches Extra circuitry required for both Previous Drowsy Leakage Policies Simple Turn off all cache lines every X cycles Little overhead, power/performance is variable No Access Turn off cache line if not accessed within X cycles Counters required per cache line Reuse Most Recently On (RMRO) Reuse Most Recently On (RMRO) No Access policy specifically for cache ways Requires some bits per cache set, only 1 counter 4
4/26/2012 Reuse Distance (RD) Policy Measures time using cache accesses to increment counters increment counters Keeps only the last N accesses “awake” for an RD of size N Ensures only N lines are ever awake Clock cycle independent Clock cycle independent Gives upper bound for power envelope Reuse Distance (RD) LRU True LRU too expensive, substitute with: Quasi-LRU via saturating counters Close approximations via timestamps Cl i ti i ti t LRU Cache Line Counter RD N=4 Cache Accesses Check Drowsy Misses These Bits Increment These Bits 0 7 1 2 1 1 23 23 2 2 3 3 99 833 2 3 0 3 832 0 1 5
4/26/2012 We Apply Region caches with the heap cache size reduced by half multiple access cache to reduced by half, multiple access cache to retain performance Drowsy cache using the RD policy Target embedded architecture and applications Experimental Setup Alpha 21264 Architecture/ISA, HotLeakage Simulator HotLeakage Simulator 1.5GHz, 70nm, 80 degrees Simulator Parameters SPEC2000 Benchmarks Using SimPoints 2 Level Cache Hierarchy 32KB 32 byte 4-Way L1 D-Cache (1 cycle) 4-Way Unified L2 256KB/512KB/1MB/2MB Drowsy Policies Drowsy Policies Simple Policy 4K Cycles (NoAccess omitted) RMRO 256 RD 15 6
4/26/2012 Column Associative MRU 7
4/26/2012 Reuse Coverage Performance simple RD 0.992 0.99 0.988 PC Normalized to DM Simple 0.986 0.984 0.982 0.98 0.978 0.976 IP 0.974 0.972 0.97 CA MRU 8
4/26/2012 Dynamic Energy simple 2-way associative simple column associative simple MRU 1.4 1 2 1.2 Consumption Normalized Simple Direct-Mapped 1 0.8 0.6 Power to 0.4 0.2 0 Static Energy simple RD 0.12 0.1 0 1 Leakage normalized to DM Non-Drowsy Cache 0.08 0.06 0.04 0.02 0 heap stack global Region 9
4/26/2012 Total Power Consumption simple RD 0.5 M Cache 0.45 Normalized to Non-Drowsy DM 0.4 0.35 0.3 0.25 0.2 0.15 Total Power 0.1 0.05 0 DM CA MRU Conclusion Cache Power Reductions Dynamic power reductions achieved via multiple y p p access caches Significant leakage reduction through RD policy Minimal performance degradation Future Work Investigate cache interaction in CMP systems Investigate cache interaction in CMP systems Use compiler hints for static cache assignments 10
4/26/2012 Q&A 11
Recommend
More recommend