Reducing Branch Penalty Branch penalty in dynamically scheduled processors: wasted cycles due to pipeline flushing on mis- predicted branches Lecture 9: Branch Prediction I Reduce branch penalty: 1. Prediction analysis, 1-bit predictor, Predict branch/jump instructions AND branch direction (taken or not taken) 2-bit predictor, branch history table, branch target buffer 2. Predict branch/jump target address (for taken branches) 3. Speculatively execute instructions along the predicted path 1 2 Prediction and Prediction Output Mis-Prediction Cases Prediction is made for EVERY For predicted taken branches (fetch_pc != pc instruction + 4), mis-predicted if the inst The only ACCURATE input is pred_PC the current PC PC � is not a branch/jump instruction; or If pre-decoded, inst type is � available � target address was predicted wrong; or Prediction is made on ALL types of instructions � is a branch but not taken Prediction output is the next Predictors PC value (which is either IM current PC + 4 or a branch target) For predicted not taken branches (fetch_pc Three guesses are made: (1) if the next inst is a branch/jump == pc + 4), mis-predicted if the inst at all; (2) if the “branch” would be taken; (3) what is the target INST feedback PC of the “taken branch”. � is a jump instruction; or From “execution” � is a branch instruction, AND the branch is taken part 3 4 Branch (direction) Prediction Mis-prediction Detections and Feedbacks Detections: Predict branch direction: taken or not taken (T/NT) At commit (most cases) BNE R1, R2, L1 At the end of decoding FETCH predictors taken … � The inst must be non- RENAME L1: … Not taken speculative Static prediction: compilers decide the direction SCHEDULE Dynamic prediction: hardware decides the Feedbacks: direction using dynamic information From commit stage REG 1-bit Branch-Prediction Buffer From decoding 1. 2-bit Branch-Prediction Buffer EXE 2. Or from WB if Correlating Branch Prediction Buffer 3. speculative feedback is WB Tournament Branch Predictor 4. allowed and more … 5. COMMIT 5 6 1
Predictor for a Single Branch Branch History Table of 1-bit Predictor BHT also Called Branch Prediction Buffer in General Form textbook K-bit Branch Can use only one 1-bit 1. Access address predictor, but accuracy is 2. Predict state low Output T/NT PC BHT: use a table of simple 2 k predictors, indexed by bits from PC 3. Feedback T/NT Similar to direct mapped 1-bit prediction cache Prediction Prediction More entries, more cost, Feedback but less conflicts, higher accuracy T NT BHT can contain complex NT predictors 1 0 Predict Taken Predict Taken T 7 8 1-bit BHT Weakness 2-bit Saturating Counter Example: in a loop, 1-bit BHT will cause 2 mispredictions Solution: 2-bit scheme where change prediction only if get misprediction twice: (Figure 3.7, p. 249) Consider a loop of 9 iterations before exit: T for (…){ NT for (i=0; i<9; i++) 11 10 Predict Taken Predict Taken a[i] = a[i] * 2.0; T } T NT � End of loop case, when it exits instead of looping NT 01 00 Predict Not as before Predict Not T Taken � First time through loop on next time through Taken code, when it predicts exit instead of looping NT � Only 80% accuracy even if loop 90% of the time Blue: stop, not taken Gray: go, taken Adds hysteresis to decision making process 9 10 Correlating Branch Predictor Correlating Branches Idea: taken/not taken of Code example showing Assemble code Branch address (4 bits) recently executed the potential branches is related to behavior of next branch 1-bits per branch local predictors (as well as the history of If (d==0) BNEZ R1, L1 that branch behavior) d=1; DADDIU R1,R0,#1 � Then behavior of If (d==1) recent branches L1: DADDIU R3,R1,#-1 Prediction Prediction selects between, say, 2 … BNEZ R3, L2 predictions of next L2: branch, updating just that prediction … � (1,1) predictor: 1-bit Observation: if BNEZ1 is not taken, then BNEZ2 global, 1-bit local 1-bit global is taken branch history (0 = not taken) 11 12 2
Correlating Branch Predictor Accuracy of Different Schemes (Figure 3.15, p. 206) General form: (m, n) 20% Branch address (4 bits) predictor 4096 Entries 2-bit BHT 18% Frequency of Mispredictions � m bits for global Unlimited Entries 2-bit BHT 2-bits per branch 16% history, n bits for local 1024 Entries (2,2) BHT local predictors history 14% Frequency of Mispredictions � Records correlation 12% 11% between m+1 branches 10% Prediction � Simple implementation: Prediction 8% global history can be 6% 6% 6% store in a shift 6% 5% 5% register 4% 4% � Example: (2,2) 2% predictor, 2-bit global, 1% 1% 2-bit global 0% 2-bit local 0% branch history nasa7 matrix300 tomcatv doducd spice fpppp gcc espresso eqntott li (01 = not taken then taken) 4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2) 13 14 Branch Target Buffer Estimate Branch Penalty Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) EX: BHT correct rate � Note: must check for branch match now, since can’t use wrong is 95%, BTB hit branch address Example: BTB combined with BHT rate is 95% Branch PC Predicted PC PC of instruction Average miss penalty FETCH is 6 cycles How much is the branch penalty? Extra =? Yes: instruction is prediction state branch and use bits No: branch not predicted PC as predicted, proceed normally next PC (Next PC = PC+4) 15 16 3
Recommend
More recommend