i ns ruct i on c nstruc on cac ache
play

i ns ruct i on C nstruc on Cac ache and and BRANCH T TARGET - PowerPoint PPT Presentation

Explor oring Predictive Replacement Policies for for i ns ruct i on C nstruc on Cac ache and and BRANCH T TARGET BU BUFFER Dead Samira Mirbagher Ajorpaz Elba Garza Sangam Jindal Daniel A. Jimnez Explor oring Predictive


  1. Explor oring Predictive Replacement Policies for for i ns ruct i on C nstruc on Cac ache and and BRANCH T TARGET BU BUFFER Dead Samira Mirbagher Ajorpaz Elba Garza Sangam Jindal Daniel A. Jiménez

  2. Explor oring Predictive Replacement Policies for for i ns ruct i on C nstruc on Cac ache and and BRANCH T TARGET BU BUFFER Dead Samira Mirbagher Ajorpaz Elba Garza Sangam Jindal Daniel A. Jiménez

  3. Explor oring Predictive Replacement Policies for for i ns ruct i on C nstruc on Cac ache and and BRANCH T TARGET BU BUFFER Dead Samira Mirbagher Ajorpaz Elba Garza Sangam Jindal Daniel A. Jiménez

  4. Explor oring Predictive Replacement Policies for for i ns ruct i on C nstruc on Cac ache and and BRANCH T TARGET BU BUFFER Dead Samira Mirbagher Ajorpaz Elba Garza Sangam Jindal Daniel A. Jiménez

  5. 21% 36% 24% LRU Rand SRRIP

  6. 21% 36% 24% LRU Rand SRRIP

  7. 21% 36% 24% LRU Rand SRRIP

  8. Thousands of workloads from popular Suite benchmark Suite suites B A Maximilien Breughe Presentation ISCA 2016 Suite Part of fifth Championship Branch Prediction, C provided by Samsung 223 training + 439 evaluation workloads

  9. Many applications have significant I-cache and BTB misses

  10. pipeline flush stalled wrong retire back end bound front end bound speculation fetch branch memory machine core fetch latency msrom clear band Base misspred bound bound width exec port ITLB Miss divider Icache target branch extern Direction mem store src1 src2 L3 L2 L1 miss miss resteer Miss

  11. pipeline flush stalled wrong retire back end bound front end bound speculation fetch branch memory machine core fetch latency msrom clear band Base misspred bound bound width exec port ITLB Miss divider Icache target branch extern Direction mem store src1 src2 L3 L2 L1 miss miss resteer Miss

  12. pipeline flush stalled wrong retire back end bound front end bound speculation fetch branch memory machine core fetch latency msrom clear band Base misspred bound bound width exec port ITLB Miss divider Icache target branch extern Direction mem store src1 src2 L3 L2 L1 miss miss resteer Miss

  13. pipeline flush stalled wrong retire back end bound front end bound speculation fetch branch memory machine core fetch latency msrom clear band Base misspred bound bound width exec port ITLB Miss divider Icache target branch extern Direction mem store src1 src2 L3 L2 L1 miss miss resteer Miss

  14. pipeline flush stalled wrong retire back end bound front end bound speculation fetch branch memory machine core fetch latency msrom clear band Base misspred bound bound width exec port ITLB Miss divider Icache target branch extern Direction mem store src1 src2 L3 L2 L1 miss miss resteer Miss

  15. pipeline flush stalled wrong retire back end bound front end bound speculation fetch branch memory machine core fetch latency msrom clear band Base misspred bound bound width exec port ITLB Miss divider Icache target branch extern Direction mem store src1 src2 L3 L2 L1 miss miss resteer Miss

  16. pipeline flush stalled wrong retire back end bound front end bound speculation fetch branch memory machine core fetch latency msrom clear band Base misspred bound bound width exec port ITLB Miss divider Icache target branch extern Direction mem store src1 src2 L3 L2 L1 miss miss resteer Miss

  17. No previous work on Predictive Replacement Policies for I-cache and BTB

  18. If A becomes dead, B and C are likely to become dead too. A PC ! B PC ! C PC !

  19. Sampling Dead Block Prediction learns from a small number of sets PC ! Update Table PC # PC β PC " PC " PC β PC ! PC ! PC # PC β PC " Prediction Table

  20. Sampling Dead Block Prediction reduces many dead blocks in LL cache Photo Credit: Sampling Dead Block Predictor by Khan et al.

  21. 30 25 LRU MPKI SDBP 20 15 10 5 Benchmark SDBP increase I-cache MPKI by 4% in average

  22. In I-cache or BTB, one PC accesses only one set PC ! PC ! PC ! PC ! D-cache I-cache/BTB

  23. G H R P

  24. GHRP correlates R euse behavior with control flow H istory PC t-4 PC t-3 Global History PC t-2 PC t-1 0 0 0 0 XOR PC t Signature

  25. GHRP correlates R euse behavior with control flow H istory PC t-4 PC t-3 PC t-2 PC t-1 0 0 0 0 XOR PC t Signature

  26. 1 bit 1 bit 3 bits 16 bits valid LRU stack prediction signature Extra information kept in I-cache block

  27. GHRP prediction is done by tracking the behavior using the signature ⇡ Reuse Eviction ⇣

  28. 1 bit 1 bit 3 bits 16 bits valid LRU stack prediction signature Extra information kept in I-cache block

  29. Voting is required for GHRP decisions Hash2 Hash3 Hash1 Prediction

  30. Voting is required for GHRP decisions Hash2 � Threshold Hash3 � Threshold Hash1 � Threshold Prediction

  31. Voting is required for GHRP decisions Hash2 � Threshold Hash3 � Threshold Hash1 � Threshold Majority vote Prediction

  32. New Signature

  33. New Signature Bypass New Prediction

  34. Victim Block Miss Not Bypass

  35. Victim Block Miss Not Bypass

  36. Victim Block Miss New Block Not Bypass

  37. Hit Block Hit

  38. Hit Block Hit

  39. Hit Block Hit Hit Block

  40. PC t-4 PC t-3 PC t-2 PC t-1 0 0 0 0 << Shift Left << PC t-3 PC t-2 PC t-1 0 0 0 PC t-3 0 PC t-2 0 PC t-1 0 PC t 0 New Global History

  41. If A becomes dead in I-cache, B is likely to become dead in BTB too Br ! A Br ! B I-cache BTB

  42. BTB and I-cache can share prediction resources

  43. BTB and I-cache can share prediction resources valid prediction LRU stack signature BTB valid prediction LRU stack signature I-cache

  44. BTB and I-cache joint design br t-4 br t-3 br t-2 br t-1 0 0 0 0

  45. Branch Predictor Simulator Hashed Perceptron CBP5 Trace driven Workloads MPKI 662 traces Comparison Short-Mobile, Long-Mobile, LRU(baseline) Short-Server, I-cache Random Long-Server SRRIP BTB CBP5, 64KB SDBP Samsung 8 Way 4K Entry 64B 8 Way

  46. Branch Predictor Simulator Hashed Perceptron CBP5 Trace driven Workloads MPKI 662 traces Comparison Short-Mobile, Long-Mobile, LRU(baseline) Short-Server, I-cache Random Long-Server SRRIP BTB CBP5, 64KB SDBP Samsung 8 Way 4K Entry 64B 8 Way

  47. Branch Predictor Simulator Hashed Perceptron CBP5 Trace driven Workloads MPKI 662 traces Comparison Short-Mobile, Long-Mobile, LRU(baseline) Short-Server, I-cache Random Long-Server SRRIP BTB CBP5, 64KB SDBP Samsung 8 Way 4K Entry 64B 8 Way

  48. Branch Predictor Branch Predictor Simulator Hashed Perceptron Hashed Perceptron CBP5 Trace driven Workloads MPKI 662 traces Comparison Short-Mobile, Long-Mobile, LRU(baseline) Short-Server, I-cache Random Long-Server SRRIP BTB CBP5, 64KB SDBP Samsung 8 Way 4K Entry 64B 8 Way

  49. Branch Predictor Simulator Hashed Perceptron CBP5 Trace driven Workloads MPKI 662 traces Comparison Short-Mobile, Long-Mobile, LRU(baseline) Short-Server, I-cache Random Long-Server SRRIP BTB CBP5, 64KB SDBP Samsung 8 Way 4K Entry 64B 8 Way

  50. 64KB, 8-way I-cache with 64B blocks 2 bit counters 16 bits 1 bit 4,096 entries 16 bits 1,024 blocks 1,024 blocks Three tables One register Prediction Tables Signature bits Prediction bits History Register 3KB 2KB 128B 2B Total storage overhead GHRP is 5.13KB or 8% of the capacity of the I-cache

  51. I-cache MPKI reduction relative to LRU With 95% certainty GHRP reduces I-cache MPKI by 33% compared to LRU

  52. BTB MPKI reduction relative to LRU With 95% certainty GHRP reduces BTB MPKI by 41% compared to LRU

  53. 21% 36% 24% 28% 48% LRU Rand SRRIP SDBP GHRP

  54. Questions?

Recommend


More recommend