Pipeline Front-End Instruction Fetch & Branch Prediction Nima - PowerPoint PPT Presentation

Spring 2016 :: CSE 502 – Computer Architecture Pipeline Front-End Instruction Fetch & Branch Prediction Nima Honarmand

Spring 2016 :: CSE 502 – Computer Architecture Big Picture

Spring 2016 :: CSE 502 – Computer Architecture Fetch Rate is an ILP Upper Bound • Instruction fetch limits performance – To sustain IPC of N, must sustain a fetch rate of N per cycle – Need to fetch N on average, not on every cycle • N-wide superscalar ideally fetches N instructions per cycle • This doesn’t happen in practice due to: – Instruction cache organization – Branches – and the interaction between the two

Spring 2016 :: CSE 502 – Computer Architecture Instruction Cache Organization • To fetch N instructions per cycle... – I$ line must be wide enough for N instructions • PC register selects I$ line • A fetch group is the set of instructions to be fetched – For N-wide machine, [PC, PC+N-1] PC Inst Inst Inst Inst Cache Line Tag Inst Inst Inst Inst Tag Inst Inst Inst Inst Tag Decoder Inst Inst Inst Inst Tag Inst Inst Inst Inst Tag

Spring 2016 :: CSE 502 – Computer Architecture Fetch Misalignment • If PC = xxx01001, N=4: – Ideal fetch group is xxx01001 through xxx01100 (inclusive) PC: xxx01001 00 01 10 11 000 Inst Inst Inst Inst Tag 001 Inst Inst Inst Inst Tag 010 Inst Inst Inst Inst Tag Decoder 011 Inst Inst Inst Inst Tag 111 Inst Inst Inst Inst Tag Line width Fetch group Misalignment reduces fetch width

Spring 2016 :: CSE 502 – Computer Architecture Reducing Fetch Misalignment • Fetch block A and A+1 in parallel – Banked I$ + rotator network • To put instructions back in correct order – May add latency (add pipeline stages to avoid slowing down clock) Bank 0: Even Sets Bank 1: Odd Sets • There are other solutions 1020 1021 using advanced data-array 1022 1023 SRAM design techniques… Rotator Inst Inst Inst Inst Aligned fetch group

Spring 2016 :: CSE 502 – Computer Architecture Program Control Flow and Branches Linearly- • Program control flow is CFG Mapped CFG dynamic traversal of Branches static CFG • CFG is mapped to linear memory Basic Blocks

Spring 2016 :: CSE 502 – Computer Architecture Types of Branches • Direction-wise : – Conditional • Conditional branches • Can use Condition code (CC) register or General purpose register – Unconditional • Jumps, subroutine calls, returns • Target-wise : – PC-encoded • PC-relative • Absolute addr – Computed (target derived from register or stack) Need direction and target to find next fetch group

Spring 2016 :: CSE 502 – Computer Architecture What’s Bad About Branches? 1. Cause fragmentation of I$ lines Inst Inst Inst Inst Tag Inst Branch Inst Inst Tag Inst Inst Inst Inst Tag Decoder Inst Inst Inst Inst Tag Inst Inst Inst Inst Tag X X 2. Cause disruption of sequential control flow – Need to determine direction and target before fetching next fetch group

Spring 2016 :: CSE 502 – Computer Architecture Branches Disrupt Sequential Control Flow Fetch • Need to determine Instruction/Decode Buffer target Decode  Target prediction Dispatch Buffer Dispatch • Need to determine direction Reservation  Direction prediction Stations Issue Branch Execute Finish Reorder/ Completion Buffer Complete Store Buffer Retire

Spring 2016 :: CSE 502 – Computer Architecture Branch Prediction • Why? – To avoid stalls in fetch stage (due to both unknown direction and target) • Static prediction – Always predict not-taken (pipelines do this naturally) – Based on branch offset (predict backward branch taken) – Use compiler hints – These are all direction prediction, what about target? • Dynamic prediction – Uses special hardware (our focus)

Spring 2016 :: CSE 502 – Computer Architecture Dynamic Branch Prediction • A form of speculation Reorder buffer (ROB) – Integrated with Fetch stage regfile I$ D$ B F D S C R P • Involves three mechanisms: – Prediction – Validation and training of the predictors – Misprediction recovery • Prediction uses two hardware predictors – Direction predictor guesses if branch is taken or not-taken • Applies to conditional branches only – Target predictor guesses the destination PC • Applies to all control transfers

Spring 2016 :: CSE 502 – Computer Architecture BP in Superscalars • Fetch group might contain multiple branches • How many branches to predict? (now) – Simple: up to the first one (maybe later) – A bit harder: up to the first taken one (maybe later) – Even harder: multiple taken branches • Only useful if you can fetch multiple fetch groups from I$ in each cycle • How to identify the branch and its target in Fetch stage? – I.e., without executing or decoding?

Spring 2016 :: CSE 502 – Computer Architecture Option 1: Partial Decoding Fetch PC L1-I Target Dir Pred Pred PD PD PD PD + sizeof(inst) Branch’s PC Huge latency (reduces clock frequency)

Spring 2016 :: CSE 502 – Computer Architecture Option 2: Predecoding Predecode branches on fill from L2 L1-I Target Dir Pred Pred Branch’s PC + Store 1 bit per inst, set if inst sizeof(inst) is a branch partial-decode logic removed High latency (L1-I on the critical path)

Spring 2016 :: CSE 502 – Computer Architecture Option 3: Using Fetch Group Addr • With one branch in fetch group, does it matter where it is? L1-I Target Dir Pred Pred • Fetch-group addr is stable + – i.e., the same set of instructions are likely to be sizeof(fetch group) if no branch fetched using the same Cache Line address fetch group in the future – Why? Latency determined by branch predictor

Spring 2016 :: CSE 502 – Computer Architecture Target Prediction

Spring 2016 :: CSE 502 – Computer Architecture Target Prediction • Target: 32- or 64-bit value • Turns out targets are generally easier to predict – Don’t need to predict not-taken target – Taken target doesn’t usually change • Only need to predict taken-branch targets Target Pred • Predictor is really just a “cache” + – Called Branch Target Buffer (BTB) sizeof(inst) PC

Spring 2016 :: CSE 502 – Computer Architecture Branch Target Buffer ( BTB ) Branch Instruction (Fetch Group) Address Branch PC BIA BTA V Branch Target Address Valid Bit = Next Fetch PC Hit?

Spring 2016 :: CSE 502 – Computer Architecture Set - Associative BTB PC V tag target V tag target V tag target = = = Next PC

Spring 2016 :: CSE 502 – Computer Architecture Making BTBs Cheaper • Branch prediction is permitted to be wrong – Processor must have ways to detect mispredictions – Correctness of execution is always preserved – Performance may be affected Can tune BTB accuracy based on cost

Spring 2016 :: CSE 502 – Computer Architecture BTB w/Partial Tags v 00000000cfff981 00000000cfff9704 00000000cfff9810 v 00000000cfff982 00000000cfff9830 00000000cfff9824 v 00000000cfff984 00000000cfff9900 00000000cfff984c 00001111beef9810 v f981 00000000cfff9704 00000000cfff9810 v f982 00000000cfff9830 00000000cfff9824 v f984 00000000cfff9900 00000000cfff984c Fewer bits to compare, but prediction may alias

Spring 2016 :: CSE 502 – Computer Architecture BTB w/PC - offset Encoding v f981 00000000cfff9704 v f982 00000000cfff9830 00000000cfff984c v f984 00000000cfff9900 v f981 ff9704 v f982 ff9830 00000000cfff984c v f984 ff9900 00000000cf ff9900 If target too far or PC rolls over, will mispredict

Spring 2016 :: CSE 502 – Computer Architecture BTB Miss? • Dir-Pred says “taken” • Target-Pred (BTB) misses – Could default to fall-through PC (as if Dir-Pred said N-t) • But we know that’s likely to be wrong! • Stall fetch until target known … when’s that? – PC-relative: after decode, we can compute target – Indirect: must wait until register read/exec

Spring 2016 :: CSE 502 – Computer Architecture Subroutine Calls P: 0x1000: (start of printf) 1 FFB 0x1000 A: 0xFC34: CALL printf 1 FC3 0x1000 B: 0xFD08: CALL printf 1 FD0 0x1000 C: 0xFFB0: CALL printf BTB can easily predict target of calls

Spring 2016 :: CSE 502 – Computer Architecture Subroutine Returns P: 0x1000: ST $RA  [$sp] 0x1B98: LD $tmp  [$sp] 0x1B9C: RETN $tmp 0 1 1B9 0xFC38 A: 0xFC34: CALL printf X A’:0xFC38: CMP $ret, 0 B: 0xFD08: CALL printf B’:0xFD0C: CMP $ret, 0 BTB can’t predict return for multiple call sites

Spring 2016 :: CSE 502 – Computer Architecture Return Address Stack ( RAS ) A: 0xFC34: CALL printf FC38 FC38 P: 0x1000: ST $RA  [$sp] D004 BTB … 0x1B9C: RETN $tmp A’:0xFC38: CMP $ret, 0 FC38 Keep track of call stack

Spring 2016 :: CSE 502 – Computer Architecture Return Address Stack Overflow 1. Wrap-around and overwrite • Will lead to eventual misprediction after four pops 2. Do not modify RAS • Will lead to misprediction on next pop • Need to keep track of # of calls that were not pushed 64AC: CALL printf FC90 top of stack 64B0 421C ??? 48C8 7300

Spring 2016 :: CSE 502 – Computer Architecture Direction Prediction

Pipeline Front-End Instruction Fetch & Branch Prediction Nima - PowerPoint PPT Presentation

Spring 2016 :: CSE 502 Computer Architecture Pipeline Front-End Instruction Fetch & Branch Prediction Nima Honarmand Spring 2016 :: CSE 502 Computer Architecture Big Picture Spring 2016 :: CSE 502 Computer Architecture Fetch

COURSE OVERVIEW WEB SKILL SETS Front-End Back-End Design Front-End Back-End MY BLOG HTTP

COURSE OVERVIEW WEB SKILL SETS Front end Back end Design Front-End Back-End MY BLOG HTTP

Front-end RESTful Back- end Connection Connecting AngularJS Front-end with RESTful Express-Node

Optimizing Front End Checkout Merchandising Maximizing Shopper Interaction In A New Era Of

Baja SAE Preliminary Design Front End + Rear End Project Description Front/Rear End SAE

SAE MINI BAJA Front & Rear End Rear End: Jacob Ruiz Front End: Will Preston Lucas Cramer

#join Front JellyBox Build: 21_LCD Front In this video, we add the front piece to the rest of the

About the guy in front Conservation Biology BSC3052 About the guy in front About the guy in

Bus Stop Queue Queues Linear list. Bus Stop One end is called front. front rear

Bus Stop Queue Queues Linear list. Bus Stop One end is called front. front rear

FINAL PRESENTATION Meet the Team David Purdum Team Leader, Lead Developer Travis Miller Front

Domain-specific front-end for virtual Domain-specific front-end for virtual system modeling

Dynamic Front End Sharing In Graphics Dynamic Front End Sharing In Graphics Processing

NOC front end development NOC front end development Work Item Update Gareth Eason, HEAnet for

The ABCN front end chip for The ABCN front end chip for ATLAS Inner Detector Upgrade Jan Kaplon

Front-End and ADC ASIC Design Front End and ADC ASIC Design Shaorui Li, Gianluigi de Geronimo*,

Adjoint Estimation of the Forecast Impact of Observation Error Correlations Derived from A

The Computational Supersingular Isogeny Problem Alfred Menezes NutMiC 2019 1 Goals of this

Recent Results in Sparse Domination Michael Lacey Georgia Tech May 31, 2018 Section 0.0 Slide

Preparing Symmetric Crypto for the Quantum World Mar a Naya-Plasencia Inria, France ERC

MDTAs I895 Bridge Project Smart Work Zone Implementation N Boston Street Interstate Avenue

Accoun&ng for mul&-scale ver&cal error correla&on within ETKF through

2004 Geothermal Map of North America (Blackwell & Richards) All data sites for US heat flow

Overview Instruction level parallelism Dynamic Scheduling Techniques Scoreboarding

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Pipeline Front-End Instruction Fetch & Branch Prediction Nima - PowerPoint PPT Presentation

Spring 2016 :: CSE 502 Computer Architecture Pipeline Front-End Instruction Fetch & Branch Prediction Nima Honarmand Spring 2016 :: CSE 502 Computer Architecture Big Picture Spring 2016 :: CSE 502 Computer Architecture Fetch

COURSE OVERVIEW WEB SKILL SETS Front-End Back-End Design Front-End Back-End MY BLOG HTTP

COURSE OVERVIEW WEB SKILL SETS Front end Back end Design Front-End Back-End MY BLOG HTTP

Front-end RESTful Back- end Connection Connecting AngularJS Front-end with RESTful Express-Node

Optimizing Front End Checkout Merchandising Maximizing Shopper Interaction In A New Era Of

Baja SAE Preliminary Design Front End + Rear End Project Description Front/Rear End SAE

SAE MINI BAJA Front &amp; Rear End Rear End: Jacob Ruiz Front End: Will Preston Lucas Cramer

#join Front JellyBox Build: 21_LCD Front In this video, we add the front piece to the rest of the

About the guy in front Conservation Biology BSC3052 About the guy in front About the guy in

Bus Stop Queue Queues Linear list. Bus Stop One end is called front. front rear

Bus Stop Queue Queues Linear list. Bus Stop One end is called front. front rear

FINAL PRESENTATION Meet the Team David Purdum Team Leader, Lead Developer Travis Miller Front

Domain-specific front-end for virtual Domain-specific front-end for virtual system modeling

Dynamic Front End Sharing In Graphics Dynamic Front End Sharing In Graphics Processing

NOC front end development NOC front end development Work Item Update Gareth Eason, HEAnet for

The ABCN front end chip for The ABCN front end chip for ATLAS Inner Detector Upgrade Jan Kaplon

Front-End and ADC ASIC Design Front End and ADC ASIC Design Shaorui Li, Gianluigi de Geronimo*,

Adjoint Estimation of the Forecast Impact of Observation Error Correlations Derived from A

The Computational Supersingular Isogeny Problem Alfred Menezes NutMiC 2019 1 Goals of this

Recent Results in Sparse Domination Michael Lacey Georgia Tech May 31, 2018 Section 0.0 Slide

Preparing Symmetric Crypto for the Quantum World Mar a Naya-Plasencia Inria, France ERC

MDTAs I895 Bridge Project Smart Work Zone Implementation N Boston Street Interstate Avenue

Accoun&amp;ng for mul&amp;-scale ver&amp;cal error correla&amp;on within ETKF through

2004 Geothermal Map of North America (Blackwell &amp; Richards) All data sites for US heat flow

Overview Instruction level parallelism Dynamic Scheduling Techniques Scoreboarding

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

SAE MINI BAJA Front & Rear End Rear End: Jacob Ruiz Front End: Will Preston Lucas Cramer

Accoun&ng for mul&-scale ver&cal error correla&on within ETKF through

2004 Geothermal Map of North America (Blackwell & Richards) All data sites for US heat flow