Correlating Branch Predictor General form: (m, n) Branch address (4 bits) predictor Lecture 10: Branch Prediction and � m bits for global 2-bits per branch Instruction Delivery history, n bits for local local predictors history � Records correlation Branch target buffer, return between m+1 branches address prediction, tournament Prediction � Simple implementation: Prediction predictor, high-performance global history can be instruction delivery store in a shift register � Example: (2,2) predictor, 2-bit global, 2-bit global 2-bit local branch history (01 = not taken then taken) 1 2 Branch Target Buffer Accuracy of Different Schemes Branch Target Buffer (BTB): Address of branch index to (Figure 3.15, p. 206) get prediction AND branch address (if taken) 20% � Note: must check for branch match now, since can’t use wrong 4096 Entries 2-bit BHT branch address 18% Frequency of Mispredictions Example: BTB combined with BHT Unlimited Entries 2-bit BHT 16% 1024 Entries (2,2) BHT Branch PC Predicted PC 14% Frequency of Mispredictions PC of instruction 12% 11% FETCH 10% 8% 6% 6% 6% 6% 5% 5% 4% 4% Extra =? Yes: instruction is 2% prediction state 1% 1% branch and use bits 0% No: branch not 0% predicted PC as predicted, proceed normally nasa7 matrix300 tomcatv doducd spice fpppp gcc espresso eqntott li next PC (Next PC = PC+4) 4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2) 3 4 Estimate Branch Penalty Return Addresses Prediction EX: BHT correct rate Register indirect branch hard to predict is 95%, BTB hit address rate is 95% � Many callers, one callee � Jump to multiple return addresses from a single address (no PC-target correlation) Average miss penalty SPEC89 85% such branches for procedure is 6 cycles return Since stack discipline for procedures, save How much is the return address in small buffer that acts like branch penalty? a stack: 8 to 16 entries has small miss rate 5 6 1
Tournament Branch Predictor Accuracy of Return Address Predictor Used in Alpha 21264: Track both “local” and global history Intended for mixed types of applications Global history: T/NT history of past k branches, e.g. 0 1 0 1 0 1 (NT T NT T NT T) PC Local Global Choice Predictor Predictor Predictor mux Global history NT/T 7 8 Tournament Branch Predictor Branch Prediction With n-way Issue Local predictor: use 10-bit local 1. Branches will arrive up to n times history, shared 3-bit counters faster in an n -issue processor 2. Amdahl’s Law => relative impact of PC Local history Counters NT/T the control stalls will be larger with table (1Kx10) (1Kx3) 10 1 the lower potential CPI in an n -issue Global and choice predictors: processor Global history Counters NT/T 12-bit (4Kx2) 12 1 Counters NT/T 010101010101 1 local/global (4Kx2) 9 10 Instruction Fetch Unit Integrated Instruction Fetch Units 1. Integrated branch prediction: branch Fetch predictor predictor becomes part of the Fetch Predicts next fetch I-cache Predictor addresses to avoid instruction fetch unit fetch delay; may 2. Instruction prefetch: fetch ahead to Fetch Branch pre-predict branch deliver multiple instructions per cycle Predictor direction; may be integrated with I- 3. Instruction memory access and Decode/REN cache buffering: may access multiple cache lines in one cycle, use prefetch to Branch predictor Out-of-erder Execution Engine hide the cost overrides and trains Another approach: trace cache fetch predictor In-order commit 11 12 2
Pitfall: Sometimes bigger and dumber is Dynamic Branch Prediction Summary better Prediction becoming important part of scalar Reversed for 21264 uses tournament execution transaction processing predictor (29 Kbits) (TP) ! Branch History Table: 2 bits for loop accuracy Earlier 21164 uses a � 21264 avg. 17 Correlation: Recently executed branches simple 2-bit predictor mispredictions per 1000 correlated with next branch. with 2K entries (or a instructions total of 4 Kbits) � 21164 avg. 15 � Either different branches mispredictions per 1000 SPEC95 benchmarks, instructions � Or different executions of same branches 21264 outperforms TP code much larger & Tournament Predictor: more resources to 21164 hold 2X branch � 21264 avg. 11.5 competitive solutions and pick between them mispredictions per 1000 predictions based on instructions local behavior (2K vs. Branch Target Buffer: include branch address 1K local predictor in � 21164 avg. 16.5 & prediction the 21264) mispredictions per 1000 Return address stack for prediction of instructions indirect jump 13 14 3
Recommend
More recommend