Density Tradeoffs of Non-Volatile Memory as a Replacement for SRAM based Last Level Cache Kunal Korgaonkar, Ishwar Bhati, Huichu Liu, Jayesh Gaur, Sasikanth Manipatruni, Sreenivas Subramoney, Tanay Karnik, Steven Swanson, Ian Young and Hong Wang
Novel Non-volatile Memory based Last Level Cache (NVMLLC) Architecture • Next generation NVM has potential for bigger and energy efficient LLC Potential ~3x capacity gain over state-of-art SRAM with logic compatible process, non-volatility o Write error rate (WER) target for industry LLC adoption increases write latency in practice • Our proposed solutions show good performance gains and can help make NVM as viable replacement of SRAM for LLC 2
Agenda • Motivation & problem • Current solutions • Our proposals • Results 3
NVMs offer ● NVMs promise high density ● Spin Torque Transfer (STT) capacity RAM, Spin Hall Effect (SHE) MRAM, etc.. advantages ● Can build large LLCs over SRAMs ● Significant power/density benefits over SRAM LLC for LLC 4
Advantage of increasing LLC capacity 1.25 1.23 SRAM 4MB 1.20 SRAM 8MB 1.15 1.15 SRAM 16MB to 4MB SRAM LLC Perf. normalized 1.10 1.05 1.00 1.00 0.95 0.90 5
But, high write latency negates the capacity gains 1.20 1.15 1.13 Performance normalized 1.10 1.05 to 4MB SRAM LLC 1.00 1.00 0.90 0.85 0.80 0.70 0.60 SRAM 4MB STTRAM 8MB WR +0ns STTRAM 8MB WR +5ns STTRAM 8MB WR +10ns STTRAM 8MB WR +20ns 6
• Architectural Techniques None of the • Dead block predictor for bypassing current • LAP techniques • Hybrid Cache reduce the • Circuit and Device Techniques write latency • Increase bit-cell transistor size, trade-off latency with enough retention/higher WER, new devices, etc 7
Our Proposal: 1 2 Reduce Write Eliminate Redundant Interference Writes 8
1 Reduce Write Interference 300 Number of requests arrived num_writes num_reads 250 200 150 100 50 0 1 13 25 37 49 61 73 85 97 109 121 133 145 157 169 181 193 205 217 229 241 253 265 277 289 301 313 325 337 349 361 373 385 397 409 421 433 445 457 469 481 493 Intervals of 10k cycles gcc.200 • Many programs exhibit long high read/write phases • Usual Dead Block Predictor based bypassing not sufficient • Need more aggressive write bypassing to reduce write interference 9
Write Congestion Aware Bypassing (WCAB) 1 Lookup Table (Tuned) Interval write occupancy Bypass score threshold ( int_write_occ ) ( byp_score_th ) NO If any read 1/4th of request queue 20% ready Half of request queue 50% Send write Send read 3/4th of request queue 70% Equal to request queue 100% write_th 75% of request queue Request queue is Find pending write with lowest NO full && pending live score ( min_score ) writes > write_th Don’t bypass NO min_score <= Get average write occupancy calculated in intervals byp_score_th ( int_write_occ ) Don’t bypass Refer Lookup Table to find bypass score threshold Bypass write with min_score ( byp_score_th ) for int_write_occ 10
Eliminates Redundant Writes 2 Percentage of frequent clean and dirty fills in LLC 100% 80% 60% 40% 20% 0% frequent clean fills frequent dirty fills one time fills • Significant percentage of frequent clean and dirty fills in LLC • Dirty fills generate writes in both Exclusive and Inclusive LLC • Clean fills create writes in Exclusive LLC 11
2 Virtual Hybrid Cache (VHC) • Write Merging in L2 • Frequent dirty lines stay in L2 for longer • Used existing technique to classify frequent dirty lines • Many writes merge in L2 reducing fills in LLC • Relaxed Exclusivity (duplicate lines b/w L2 and LLC) • Enhancement over LAP for Exclusive Cache • Retain the duplicate lines near LRU to reduce hit rate loss • Dirty lines (whenever found) not duplicated in LLC 12
Simulation Methodology & Results 13
Simulation Methodology • Used modified version Multi2Sim simulating 4 x86 cores • Core parameters similar to Intel Skylake • SRAM baseline: 4MB, 4 banks, 16 ways with round trip delay of 20 cycles • STTRAM baseline: 8MB, 8 banks, additional write latency of 20ns • Workloads: • Selected 20 workloads from SPEC 2006 and HPCG • With High L2 MPKI and a range of LLC MPKIs (Table 1 in the paper) • 20 homogeneous and 44 heterogeneous (by randomly mixing the 20 workloads) 14
Performance vs STTRAM LLC Baseline WCAB WCAB+VHC 1.8 Performance normalized to 1.7 STTRAM 8MB baseline 1.6 1.5 1.4 1.3 1.2 1.1 1.0 0.9 Our proposals provide 26% performance gain over the baseline 15
Performance vs Similar Area SRAM LLC 1.30 1.18 Performance normalized 1.20 1.12 1.12 1.10 1.07 1.10 1.03 to SRAM 4MB 1.00 0.87 0.90 0.80 0.71 0.70 0.60 5 ns, 7 MB 10 ns, 12 MB 20 ns, 16 MB 30 ns, 20 MB STT - baseline STT - Proposed Architecture Our proposals provide up to 18% performance gain over the SRAM of same area 16
Performance vs Prior Art 1.4 Performance normalized to 8MB STTRAM baseline 1.30 1.3 1.26 1.18 1.2 1.13 1.11 1.10 1.09 1.09 1.07 1.1 1.05 1.04 1.03 1.0 0.9 Homogeneous Heterogeneous Geomean Hybrid LLC - 2MB SRAM, 4MB STTRAM Hybrid LLC - 1MB SRAM, 6MB STTRAM STTRAM LLC - LAP STTRAM LLC - Proposed Architecture Our proposals perform significantly better than the prior art 17
Conclusions • Next generation NVM has potential for bigger and energy efficient LLC • Require architectural solutions to absorb high write latency and obtain capacity benefits • Our proposed solutions show good performance gains and can help make NVM as viable replacement of SRAM for LLC THANK YOU! ! 18
Recommend
More recommend