Explor oring Predictive Replacement Policies for for i ns ruct i on C nstruc on Cac ache and and BRANCH T TARGET BU BUFFER Dead Samira Mirbagher Ajorpaz Elba Garza Sangam Jindal Daniel A. Jiménez
Explor oring Predictive Replacement Policies for for i ns ruct i on C nstruc on Cac ache and and BRANCH T TARGET BU BUFFER Dead Samira Mirbagher Ajorpaz Elba Garza Sangam Jindal Daniel A. Jiménez
Explor oring Predictive Replacement Policies for for i ns ruct i on C nstruc on Cac ache and and BRANCH T TARGET BU BUFFER Dead Samira Mirbagher Ajorpaz Elba Garza Sangam Jindal Daniel A. Jiménez
Explor oring Predictive Replacement Policies for for i ns ruct i on C nstruc on Cac ache and and BRANCH T TARGET BU BUFFER Dead Samira Mirbagher Ajorpaz Elba Garza Sangam Jindal Daniel A. Jiménez
21% 36% 24% LRU Rand SRRIP
21% 36% 24% LRU Rand SRRIP
21% 36% 24% LRU Rand SRRIP
Thousands of workloads from popular Suite benchmark Suite suites B A Maximilien Breughe Presentation ISCA 2016 Suite Part of fifth Championship Branch Prediction, C provided by Samsung 223 training + 439 evaluation workloads
Many applications have significant I-cache and BTB misses
pipeline flush stalled wrong retire back end bound front end bound speculation fetch branch memory machine core fetch latency msrom clear band Base misspred bound bound width exec port ITLB Miss divider Icache target branch extern Direction mem store src1 src2 L3 L2 L1 miss miss resteer Miss
pipeline flush stalled wrong retire back end bound front end bound speculation fetch branch memory machine core fetch latency msrom clear band Base misspred bound bound width exec port ITLB Miss divider Icache target branch extern Direction mem store src1 src2 L3 L2 L1 miss miss resteer Miss
pipeline flush stalled wrong retire back end bound front end bound speculation fetch branch memory machine core fetch latency msrom clear band Base misspred bound bound width exec port ITLB Miss divider Icache target branch extern Direction mem store src1 src2 L3 L2 L1 miss miss resteer Miss
pipeline flush stalled wrong retire back end bound front end bound speculation fetch branch memory machine core fetch latency msrom clear band Base misspred bound bound width exec port ITLB Miss divider Icache target branch extern Direction mem store src1 src2 L3 L2 L1 miss miss resteer Miss
pipeline flush stalled wrong retire back end bound front end bound speculation fetch branch memory machine core fetch latency msrom clear band Base misspred bound bound width exec port ITLB Miss divider Icache target branch extern Direction mem store src1 src2 L3 L2 L1 miss miss resteer Miss
pipeline flush stalled wrong retire back end bound front end bound speculation fetch branch memory machine core fetch latency msrom clear band Base misspred bound bound width exec port ITLB Miss divider Icache target branch extern Direction mem store src1 src2 L3 L2 L1 miss miss resteer Miss
pipeline flush stalled wrong retire back end bound front end bound speculation fetch branch memory machine core fetch latency msrom clear band Base misspred bound bound width exec port ITLB Miss divider Icache target branch extern Direction mem store src1 src2 L3 L2 L1 miss miss resteer Miss
No previous work on Predictive Replacement Policies for I-cache and BTB
If A becomes dead, B and C are likely to become dead too. A PC ! B PC ! C PC !
Sampling Dead Block Prediction learns from a small number of sets PC ! Update Table PC # PC β PC " PC " PC β PC ! PC ! PC # PC β PC " Prediction Table
Sampling Dead Block Prediction reduces many dead blocks in LL cache Photo Credit: Sampling Dead Block Predictor by Khan et al.
30 25 LRU MPKI SDBP 20 15 10 5 Benchmark SDBP increase I-cache MPKI by 4% in average
In I-cache or BTB, one PC accesses only one set PC ! PC ! PC ! PC ! D-cache I-cache/BTB
G H R P
GHRP correlates R euse behavior with control flow H istory PC t-4 PC t-3 Global History PC t-2 PC t-1 0 0 0 0 XOR PC t Signature
GHRP correlates R euse behavior with control flow H istory PC t-4 PC t-3 PC t-2 PC t-1 0 0 0 0 XOR PC t Signature
1 bit 1 bit 3 bits 16 bits valid LRU stack prediction signature Extra information kept in I-cache block
GHRP prediction is done by tracking the behavior using the signature ⇡ Reuse Eviction ⇣
1 bit 1 bit 3 bits 16 bits valid LRU stack prediction signature Extra information kept in I-cache block
Voting is required for GHRP decisions Hash2 Hash3 Hash1 Prediction
Voting is required for GHRP decisions Hash2 � Threshold Hash3 � Threshold Hash1 � Threshold Prediction
Voting is required for GHRP decisions Hash2 � Threshold Hash3 � Threshold Hash1 � Threshold Majority vote Prediction
New Signature
New Signature Bypass New Prediction
Victim Block Miss Not Bypass
Victim Block Miss Not Bypass
Victim Block Miss New Block Not Bypass
Hit Block Hit
Hit Block Hit
Hit Block Hit Hit Block
PC t-4 PC t-3 PC t-2 PC t-1 0 0 0 0 << Shift Left << PC t-3 PC t-2 PC t-1 0 0 0 PC t-3 0 PC t-2 0 PC t-1 0 PC t 0 New Global History
If A becomes dead in I-cache, B is likely to become dead in BTB too Br ! A Br ! B I-cache BTB
BTB and I-cache can share prediction resources
BTB and I-cache can share prediction resources valid prediction LRU stack signature BTB valid prediction LRU stack signature I-cache
BTB and I-cache joint design br t-4 br t-3 br t-2 br t-1 0 0 0 0
Branch Predictor Simulator Hashed Perceptron CBP5 Trace driven Workloads MPKI 662 traces Comparison Short-Mobile, Long-Mobile, LRU(baseline) Short-Server, I-cache Random Long-Server SRRIP BTB CBP5, 64KB SDBP Samsung 8 Way 4K Entry 64B 8 Way
Branch Predictor Simulator Hashed Perceptron CBP5 Trace driven Workloads MPKI 662 traces Comparison Short-Mobile, Long-Mobile, LRU(baseline) Short-Server, I-cache Random Long-Server SRRIP BTB CBP5, 64KB SDBP Samsung 8 Way 4K Entry 64B 8 Way
Branch Predictor Simulator Hashed Perceptron CBP5 Trace driven Workloads MPKI 662 traces Comparison Short-Mobile, Long-Mobile, LRU(baseline) Short-Server, I-cache Random Long-Server SRRIP BTB CBP5, 64KB SDBP Samsung 8 Way 4K Entry 64B 8 Way
Branch Predictor Branch Predictor Simulator Hashed Perceptron Hashed Perceptron CBP5 Trace driven Workloads MPKI 662 traces Comparison Short-Mobile, Long-Mobile, LRU(baseline) Short-Server, I-cache Random Long-Server SRRIP BTB CBP5, 64KB SDBP Samsung 8 Way 4K Entry 64B 8 Way
Branch Predictor Simulator Hashed Perceptron CBP5 Trace driven Workloads MPKI 662 traces Comparison Short-Mobile, Long-Mobile, LRU(baseline) Short-Server, I-cache Random Long-Server SRRIP BTB CBP5, 64KB SDBP Samsung 8 Way 4K Entry 64B 8 Way
64KB, 8-way I-cache with 64B blocks 2 bit counters 16 bits 1 bit 4,096 entries 16 bits 1,024 blocks 1,024 blocks Three tables One register Prediction Tables Signature bits Prediction bits History Register 3KB 2KB 128B 2B Total storage overhead GHRP is 5.13KB or 8% of the capacity of the I-cache
I-cache MPKI reduction relative to LRU With 95% certainty GHRP reduces I-cache MPKI by 33% compared to LRU
BTB MPKI reduction relative to LRU With 95% certainty GHRP reduces BTB MPKI by 41% compared to LRU
21% 36% 24% 28% 48% LRU Rand SRRIP SDBP GHRP
Questions?
Recommend
More recommend