A tour of a microprocessor museum Tour t heme: µ -architectural parallelism complicates performance understanding. BAFL: Bottleneck Analysis of Fine-grain Parallelism Tour game: “ Bot t leneck Hunt ” Which instruction slowed down the execution, and by how much? Prof. Rast islav Bodík More specifically, why t he following model fails? execution time = wit h Brian Fields in part wit h Shai Rubin, Prof . Mark Hill, Prof . Mary Vernon cost(instruction 1 ) + … + cost(instruction n ) [cycles] University of Wisconsin The computer system A tour of a microprocessor museum (0) • Many levels of granularity, each with unique no parallelism performance problems Int el 80386 internet WANS • servers • fetch • • decode • execut e microprocessors • Our goal: • quantitative approach for modern (out -of-order) • processors Who cares? A tour of a microprocessor museum (1) • Archit ect s: scalar pipeline parallelism circuit complexity • Int el 80486 power consumption • • Soft ware engineers: performance-critical software • St udent s: • Fetch Decd Read Exe Mem Writ e intuition how processors work • Processors: • understand themselves •
Why crit ical pat h? A tour of a microprocessor museum (2) � Microprocessors are fine-grain parallel systems in-order superscalar pipeline Int el Pent ium like wide-area networks: • queues are like routers, pipelines are like communicat ion links . • many (bad) event s going on in parallel, t heir lat ency t olerat ed Fetch Decd Read Exe Mem Writ e Fetch Decd Read Exe Mem Writ e The Bill Cosby Rule:* “ You’ re not a parent if you only have one child.” *rule named by Amir Roth A tour of a microprocessor museum (3) Outline out-of-order superscalar � The model of micro-execution Int el Pent ium 4 • capt ure bot h program and processor const raint s Four met rics: crit icalit y • slack • • execut ion modes • cost Critical path of a microexecut ion A tour of a microprocessor museum (end) out-of-order superscalar Critical path misconceptions: typical buffers, queues, windows • “ Every ‘ bad event’ is critical.” • branch misprediction • reorder-buffer st all • L1 cache miss • L2 cache miss decode reorder store buffer buffer • “ Critical path is obvious … ” buffer (ROB) … it cont ains inst ruct ions providing dat a for ‘ bad event s’ reservat ion missed st at ions loads � processors are good at tolerating latency, but poor at deciding what to tolerate.
Modeling: why hard? Critical Path Models (3) OOO + finite re -order buffer Crit ical pat h consist s of: 1. inst ruct ions and dat a dependences as in a traditional “ compiler” view Fetch • F F F F F microarchit ect ural resource const raint s 2. Execute • branch mispredictions, finite fetch b/ w, etc. E E E E E Together describe the microexecut ion of a Commit C C C C C given program executing on a given machine ROB Size How t o model in a uniform way? Critical Path Models (1) Critical Path Models (4) OOO + finite ROB + branch misp First, for a simple in-order machine Data dependencies • Resource dependencies • Fetch F F F F F Execute E E E E E Commit Dynamic C C C C C i 1 i 2 i 3 i 4 i 5 Instructions oldest newest mispredict ed branch Resources constrain the dataflow execution Critical Path Models (2) Example first inst ruct ion For an out-of-order machine 0 1 0 1 0 1 0 F Fetch F F F F F in order 4 1 1 1 1 1 1 1 1 2 3 1 1 Execute E E E E E E out of order 1 1 1 3 2 4 2 2 2 2 2 Commit 0 1 0 1 0 1 0 C C C C C C in order 0 0 0 0 0 i 1 i 2 i 3 i 4 i 5 oldest newest last inst ruct ion
Example Execution Modes Three modesof execution fetch limited (F-mode) 0 1 0 1 0 1 0 execute limited (E-mode) F commit limited (C-mode) 4 1 1 1 1 1 1 1 1 2 3 1 1 F F F F F F F F F F F F F E F-mode 1 1 1 3 2 4 2 2 2 2 2 E E E E E E E E E E E E E 0 1 0 1 0 1 0 C E-mode C C C C C C C C C C C C C 0 0 0 0 0 C-mode CP Lengt h = 16 cycles ⇒ Exe Time = 16 cycles Example Execution Modes Ent ering F-mode what if t his load is an L1 miss? S tart of program Branch misp. ROB stall (3 cycles � 12 cycles) F F F F F F F F F F F 0 1 0 1 0 1 0 F E E E ... ... E E E ... ... E E E E E ... 4 1 1 1 1 1 1 1 1 2 3 1 1 C C C C C C C C C C C E 1 st inst . in program 1 1 1 Ent ering E-mode Ent ering C -mode 3 2 4 2 2 2 2 2 Fetch catches up ROB stall 0 1 0 1 0 1 0 C F F F F F F F F 0 ... E E E E E ... ... E E E ... CP Lengt h = 16 cycles ⇒ Exe Time = 16 cycles C C C C C C C C Example Validation: can we trust our model? Execution Time Reduction (in cycles) per Cycle of Latency Reduced what if t his load is an L1 miss? (3 cycles � 12 cycles) 1 0 1 0 1 0 1 0 F 0.9 Reducing CP Latencies 4 0.8 1 1 1 1 1 1 1 1 2 12 1 1 0.7 0.6 E 0.5 1 1 1 0.4 3 2 4 2 2 2 2 2 0.3 0 1 0 1 0 1 0 C Reducing non-CP Latencies 0.2 0.1 0 0 crafty eon gcc gzip parser perl twolf vortex ammp art galgel mesa CP Lengt h = 19 cycles ⇒ Exe Time = 19 cycles
Outline S tep 1. Observing R2 The model of micro-execution R1 R2 + R3 E capt ure bot h program and processor const raint s R3 • Dependence resolved early Four met rics: • criticality If dependence int o R2 is on crit ical pat h, t hen value of R2 arrived last . • slack execut ion modes • critical ⇒ cost arrives last • ⁄ arrives last ⇒ crit ical Why criticality predictor? Policies! Last-arrive edges: a CPU stethoscope F � E E � E C � F mechanism mechanism mechanism mechanism policy: policy: policy: policy: current current current current bet t er bet t er bet t er bet t er E � F E � C F � F OOO execut ion OOO execut ion OOO execut ion OOO execut ion how to schedule? how to schedule? how to schedule? how to schedule? oldest first oldest first oldest first oldest first critical first critical first critical first critical first predict ion and predict ion and predict ion and predict ion and when t o speculat e? when t o speculat e? when t o speculat e? when t o speculat e? on each predict ion on each predict ion on each predict ion on each predict ion only critical only critical only critical only critical speculat ion speculat ion speculat ion speculat ion C � C how to serve mem how to serve mem how to serve mem how to serve mem non- blocking caches non- blocking caches non- blocking caches non- blocking caches FIFO FIFO FIFO FIFO critical first critical first critical first critical first requests? requests? requests? requests? CPU pre -fetch, pre -fetch, pre -fetch, pre -fetch, what t o prefet ch? what t o prefet ch? what t o prefet ch? what t o prefet ch? all misses all misses all misses all misses prefet ch crit ical prefet ch crit ical prefet ch crit ical prefet ch crit ical pre -execut e pre -execut e pre -execut e pre -execut e � Current policies are egalitarian: all “ bad” events equally harmful. Prediction: why hard? Implementing last-arrive edges Observe events within the machine Three st eps: observe the microexecution ⇒ hard! 1. F F F F F F F F • measuring edge latencies is intrusive E E E E E E E E analyze to find critical path ⇒ hard! C C C C C C C C 2. E � F if branch misp. C � F if ROB st all F � F ot herwise • graph too large to buffer and topological sort too complex • F F F F F F F F F store prediction for later use ⇒ easy! 3. E E E E E E E E E • store in table indexed by PC C C C C C C C C C E � E observe F � E if dat a E � C if commit C � C ot herwise arrival order of point er is ready on fet ch operands delayed
Last-arrive edges … and we’ve found the critical path! Backward propagate along last-arrive edges 0 0 1 1 0 1 0 F F 1 1 1 1 1 1 4 1 1 1 2 3 1 E E 1 1 1 3 2 4 2 2 2 2 2 0 1 0 1 0 1 0 C C 0 0 0 0 0 � Found CP by only observing last-arrive edges � but still requires constructing entire graph Remove latencies Prediction: why hard? Three st eps: Do not need explicit weights ⇒ solved! 1. observe the microexecution measuring edge latencies is intrusive • F 2. analyze to find critical path graph too large to buffer ⇒ hard! • and topological sort too complex ⇒ solved! E • store prediction for later use ⇒ easy! 3. C store in table indexed by PC • Prune the graph Step 2. Efficient analysis (predictor training) Only last-arrive edges needed CP is a ” long” chain of last -arrive edges. (other edges must be non-critical) ⇒ t he longer a given chain of last-arrive edges, t he more likely it is part of t he CP F Algorithm: find sufficient ly long last -arrive chains Plant token into a node n 1. E Propagate forward, only along last -arrive edges 2. Check for token after several hundred cycles 3. C If token alive, n is assumed critical 4.
Recommend
More recommend