accord associativity for dram caches by coordinating way
play

ACCORD: Associativity for DRAM Caches by Coordinating Way-Install - PowerPoint PPT Presentation

ACCORD: Associativity for DRAM Caches by Coordinating Way-Install and Way-Prediction ISCA 2018 Vinson Young (GT) Chiachen Chou (GT) Authors: Aamer Jaleel (NVIDIA) Moinuddin K. Qureshi (GT) 1 3D-DRAM MITIGATES BANDWIDTH WALL Modern system


  1. ACCORD: Associativity for DRAM Caches by Coordinating Way-Install and Way-Prediction ISCA 2018 Vinson Young (GT) Chiachen Chou (GT) Authors: Aamer Jaleel (NVIDIA) Moinuddin K. Qureshi (GT) 1

  2. 3D-DRAM MITIGATES BANDWIDTH WALL Modern system packing many cores è Bandwidth Wall ✔ 4-8x Bandwidth (of traditional memory) ✘ Limited Capacity 3D-Stacked DRAM Memory Hybrid Memory Cube (HMC) from Micron, 3D-DRAM + High-Capacity Memory = Hybrid Memory High Bandwidth Memory (HBM) from Samsung 2

  3. USE 3D-DRAM AS A CACHE fast CPU CPU Memory Hierarchy L1$ L1$ L2$ L2$ L3$ DRAM-Cache (3D-DRAM) System Memory MCDRAM from Intel (NVM / DRAM) slow OS-visible Space Using 3D-DRAM as a DRAM cache, can improve memory bandwidth (and avoid OS/software change) 3

  4. ARCHITECTING LARGE DRAM CACHES Organize at line granularity (64B) for capacity/BW utilization Gigascale cache needs large tag-store (tens of MBs) 128 MB 4GB Data Tags Tags? 3D-DRAM Too large for SRAM 4

  5. ARCHITECTING LARGE DRAM CACHES Organize at line granularity (64B) for high cache utilization Gigascale cache needs large tag-store (tens of MBs) 128 MB 4GB Data Tags 3D-DRAM Practical designs must store Tags in DRAM How to architect tag-store for low-latency tag access? 5

  6. EFFICIENT TAG ORGANIZATION (KNL CACHE) Tag-With-Data [Alloy Cache, Intel Knights Landing] Tag Data Tag Data Tag Data Tag Data Single Tag+Data Lookup (1x hit latency), but direct-mapped Practical designs are 64B line-size , store Tag-With-Data , and are direct-mapped , to optimize for hit-latency. Intel Knights Landing Product (MCDRAM) uses this DRAM-cache organization. 6

  7. POTENTIAL OF ASSOCIATIVITY 90 Reduce 25% of misses Hit Rate (%) 80 70 60 y y y y a a a a w w w w - - - - 1 2 4 8 How can we make DRAM caches associative? Assumes 16-core system, with 4GB DRAM-Cache, in front of PCM memory. 7

  8. ASSOCIATIVITY OPTION 1: SERIAL TAG LOOKUP Way 0 Way 1 Address A B If miss A B Serial Tag Lookup enables associativity, but, it has serialization delay. 8

  9. ASSOCIATIVITY OPTION 2: PARALLEL TAG LOOKUP Way 0 Way 1 Address A B A B Parallel Lookup avoids serialization latency, but, it introduces 2x bandwidth cost. 9

  10. ASSOCIATIVITY FOR DRAM CACHE (PARALLEL) 90 1.5 Reduce 25% Speedup (Parallel) of misses -46% Hit Rate (%) 80 1 70 0.5 60 0 1-way 2-way 4-way 8-way 2-way 4-way 8-way (b) Speedup (Parallel) Increasing associativity naively actually degrades performance due to increased BW cost 10

  11. ASSOCIATIVITY FOR DRAM CACHE (IDEAL) 90 1.5 1.5 Reduce 25% Speedup (Idealized) 21% Speedup (Parallel) of misses -46% Hit Rate (%) 80 1 1 70 0.5 0.5 60 0 0 y y y y y y y y y y a a a a a a a a a a w w w w w w w w w w - - - - - - - - - - 1 2 4 8 2 4 8 2 4 8 (b) Speedup (Parallel) (c) Speedup (Idealized) With latency / BW of direct-mapped Associativity must still maintain the latency/BW of direct-mapped caches. How? 11

  12. OPTION 3: WAY-PREDICTED TAG LOOKUP Way 0 Way 1 Address A B Way Prediction If miss B Way-Predicted Tag Lookup Way-Predicted Tag Lookup can obtain improved hit- rate, with BW / latency of direct-mapped cache. 12

  13. WAY-PREDICTION ACCURACY & COST MRU Pred Partial-Tag (1bit/set) (4bit/line) SRAM Storage 4MB 32MB Way-Pred Accuracy 85.7% 97.3% (2-way) Accuracy (4-way) 74.3% 91.6% Accuracy (8-way) 63.2% 81.2% Prior methods for way-prediction have low accuracy and/or have high storage overhead. 13

  14. TOWARDS ASSOCIATIVITY W/ WAY-PREDICTION Way 0 Way 1 Address A B Way Prediction If miss B Way-Predicted Tag Lookup Goal: Low storage-overhead and high accuracy way-prediction, to enable associative DRAM cache 14

  15. ACCORD OVERVIEW • Background • ACCORD – Probabilistic Way-Steering (PWS) – Ganged Way-Steering (GWS) – Skewed Way-Steering (SWS) • Summary 15

  16. INSIGHT: WAY-PREDICTABILITY AT LOW STORAGE? Way 0 Way 1 Way 0 Way 1 EVEN EVEN ODD ODD EVEN EVEN ODD ODD EVEN ODD EVEN ODD Base Install Policy (Rand) Tag-based Install Policy Predict 100%! Hard-to-predict (~50%) But, direct-mapped Insight: Modifying install policy can make way- prediction much simpler! 16

  17. PROPOSAL: ACCORD Coordinate Way Install Way 0 Way 1 Way Predictor Policy A2 A3 B3 B5 B7 A sso C iativity by C o ORD inating way-install and prediction . ACCORD achieves a way-predictable cache at low cost. 17

  18. ACCORD OVERVIEW • Background • ACCORD – Probabilistic Way-Steering (PWS) – Ganged Way-Steering (GWS) – Skewed Way-Steering (SWS) • Summary 18

  19. PROBABILISTIC WAY-STEERING Page A,B Bias=90% 10% Preferred Address Way 0 Way 1 B0 B0 A0 A0 B1 A1 B1 A1 B2 B2 A2 A2 B3 A3 A3 B3 A4 A4 B4 B4 A5 A5 B5 B5 B6 A6 A6 B6 B7 B7 A7 A7 Static prediction: ~90% Install using PWS Will use both ways, improve hit-rate PWS enables way-predictability, by trading speed of learning to use both ways (hit-rate) 19

  20. SENSITIVITY TO PWS PROBABILITY Preferred-way Install Probability = x% bias to install in preferred way Way-Pred Accuracy 14% 100% Miss Reduction (%) Way-Pred Accuracy (%) 12% 80% 10% 60% 8% 6% 40% 4% 20% 2% 0% 0% 50% 60% 70% 80% 85% 90% 100% Bias for selecting “preferred way” 2-way design Direct-mapped 20

  21. SENSITIVITY TO PWS PROBABILITY Miss Reduction (%) Way-Pred Accuracy 14% 100% Miss Reduction (%) Way-Pred Accuracy (%) 12% 80% 10% 60% 8% 6% 40% 4% 20% 2% 0% 0% 50% 60% 70% 80% 85% 90% 100% Preferred-way Install Probability 2-way design Direct-mapped 21

  22. SENSITIVITY TO PWS PROBABILITY 5.6% speedup Speedup Miss Reduction (%) Way-Pred Accuracy 14% 100% Miss Reduction (%) Way-Pred Accuracy (%) 12% 80% 10% Speedup (%) 60% 8% 5.6% 5.5% 5.3% 6% 4.7% 40% 3.7% 4% 2.6% 20% 2% 0.0% 0% 0% 50% 60% 70% 80% 85% 90% 100% Preferred-way Install Probability Preferred-way Install Probability (85%) provides best trade-off of hit-rate for WP accuracy, for 5.6% speedup. 22

  23. ACCORD OVERVIEW • Background • ACCORD – Probabilistic Way-Steering (PWS) – Ganged Way-Steering (GWS) – Skewed Way-Steering (SWS) • Summary 23

  24. GANGED WAY - STEERING Preferred Preferred Address Address Way 0 Way 1 Way 0 Way 1 B0 B0 A0 A0 B0 A0 B0 A1 B1 A1 B1 A1 B1 B1 B2 B2 A2 A2 A2 B2 B2 A3 B3 B3 A3 B3 A3 B3 A4 B4 B4 A4 B4 A4 B4 A5 B5 B5 A5 B5 A5 B5 B6 B6 A6 A6 A6 B6 B6 B7 A7 B7 A7 B7 A7 B7 Pred ~50% Pred >90% Probabilistic Way-Steering Ganged Way-Steering Per-line randomized decision Per-page rand decision Ganged Way-Steering makes install decision at large granularity, to improve predictability for workloads with high spatial locality. 24

  25. GANGED WAY - STEERING IMPLEMENTATION Guide Install Predict Way Access RegionID Way RegionID Way Install 0x001 0 0x101 1 Way 0 Way 1 Recent Install A2 A3 B3 Table (RIT) Recent Lookup B5 Table (RLT) B7 GWS Per-Region Last-Way install + Last-Way prediction. 64-entry RIT and 64-entry RLT needs only 320 Bytes . 25

  26. PWS+GWS WAY-PREDICTION ACCURACY GWS enables spatial workloads to PWS has ~85% base accuracy have near-100% accuracy 100% 100% Way-Pred Acc (%) 95% 95% 90% 90% 85% 85% 80% 80% 75% 75% 70% 70% PWS PWS+GWS PWS PWS+GWS Average (21 workloads) Libquantum Combination of PWS+GWS achieves 90% accuracy, at the cost of 320B storage. 26

  27. PWS+GWS (ACCORD 2-WAY) RESULTS 7.3% speedup 12% 10% 8% Speedup 6% 4% 2% 0% PWS+GWS PWS Perfect PWS + GWS gets 7.3% of 10% speedup of perfectly-predicted 2-way cache. System assumes 4GB DRAM Cache, and PCM-based main memory. 27

  28. ACCORD OVERVIEW • Background • ACCORD – Probabilistic Way-Steering (PWS) – Ganged Way-Steering (GWS) – Skewed Way-Steering (SWS) • Summary 28

  29. DIFFICULTY IN SCALING TO N-WAYS • Scaling ACCORD to N-ways – ACCORD 4-way has 3% speedup – ACCORD 8-way has 6% slowdown… Way 0 Way 1 Way 2 Way 3 Address E A B C D E Miss! • Miss confirmation: N-way cache needs N accesses to confirm line is not resident We need solutions to reduce miss-confirmation 29

  30. SOLUTION: SKEWED WAY-STEERING 4-way with 2-skew: Access: ABC One Preferred + One Alternate way A A B C B Way 1 Way 2Way 3 Way 0 Access: E Only 2 lookups to determine miss Restricting placement, reduces miss-confirmation è hit-rate benefits without any storage overhead 30

  31. SPEEDUP FROM ACCORD (WITH SWS) 12% 10% 8% Speedup 6% 4% 2% 0% 4-Way 8-Way 2-Way SWS 8-way achieves 11% speedup 31

  32. ACCORD OVERVIEW • Background • ACCORD – Probabilistic Way-Steering (PWS) – Ganged Way-Steering (GWS) – Skewed Way-Steering (SWS) • Summary 32

  33. SUMMARY OF ACCORD § ACCORD: associative DRAM caches by coordinating way- install and way-prediction. § Probabilistic Way-Steering § Biased-install enables accurate static way-prediction § Ganged Way-Steering § Region-based install enables accurate region-based way-prediction § Skewed Way-Steering § Skew enables flexibility in line placement, while maintaining miss cost § ACCORD enables associativity at negligible storage cost (320B), to achieve 11% speedup. 33

  34. ACCORD BACKUP SLIDES ACCORD backup slides 34

Recommend


More recommend