a framework for modeling and a framework for modeling and
play

A Framework for Modeling and A Framework for Modeling and - PowerPoint PPT Presentation

A Framework for Modeling and A Framework for Modeling and Optimization of Prescient Instruction Optimization of Prescient Instruction Prefetch Prefetch ACM Sigmetrics Sigmetrics 2003 2003 June 12, 2003 June 12, 2003 ACM Tor M.


  1. A Framework for Modeling and A Framework for Modeling and Optimization of Prescient Instruction Optimization of Prescient Instruction Prefetch Prefetch ACM Sigmetrics Sigmetrics 2003 2003 – – June 12, 2003 June 12, 2003 ACM Tor M. Aamodt †‡ , Pedro Marcuello § , Paul Chow ‡ , Antonio Gonzalez § Per Hammarlund ¶ , Hong Wang † , John P. Shen † † Microprocessor Research, Intel Labs ‡ Dept. of Electrical and Computer Engineering, University of Toronto § Intel Barcelona Research Center ¶ Desktop Products Group, Intel Corp

  2. Multithreading Multithreading pc pc $ Single chip, multiple flows of control Question: How might a single-threaded application exploit this hardware capability? Tor M. Aamodt

  3. Helper Threads Helper Threads µ Arch Use spare thread context(s) to reduce µ Arch Use spare thread context(s) to reduce bottlenecks. Typically do not need to satisfy all bottlenecks. Typically do not need to satisfy all correctness constraints. correctness constraints. Related work on helper threads Related work on helper threads � Helper threads : Helper threads : Chappell & Patt Chappell & Patt, Dubois & Song , Dubois & Song � � Slices: Slices: Zilles & & Sohi Sohi, Roth & , Roth & Sohi Sohi Zilles � � Data prefetch: Data prefetch: Zilles & & Sohi Sohi, Collins et al., , Collins et al., Annavaram Annavaram & & Zilles � Davidson, Davidson, Luk Luk, , Moshovos Moshovos et al., et al., Liao Liao et al. et al. � Branch prediction: Branch prediction: Chappell & Chappell & Patt Patt � This work : first work to study using helper threads for This work : first work to study using helper threads for instruction prefetch prefetch (may also help TC pre instruction (may also help TC pre- -building) building) Tor M. Aamodt

  4. Existing/Proposed Techniques Existing/Proposed Techniques � Traditional hardware Traditional hardware - - scalability scalability � � Helper thread Helper thread – – a a few few “delinquent” instruction “delinquent” instruction � � Runahead Runahead – – need simultaneous I & D miss need simultaneous I & D miss � Tor M. Aamodt

  5. Prescient Instruction Prefetch Prescient Instruction Prefetch Main Thread Prefix spawn Infix target Helper Thread Postfix I-cache Misses Tor M. Aamodt

  6. Optimization of Prescient Optimization of Prescient Instruction Prefetch Instruction Prefetch Optimization problem can be divided into two two parts Optimization problem can be divided into parts � � 1. 1. Selection of SPAWN- Selection of SPAWN -TARGET pairs TARGET pairs 2. Optimization of resulting thread code, and hardware used to run it it 2. Optimization of resulting thread code, and hardware used to run This paper focuses on the first issue only This paper focuses on the first issue only � � Optimization Algorithms Path Expression Mappings Stochastic Path Analysis HW Abstraction Tor M. Aamodt

  7. HW Abstraction HW Abstraction Memory Fully Associative I-Cache (line size = 1 inst.) Intra-procedural control Instruction flow = Markov Chain Sequencer Call/returns paired Tor M. Aamodt

  8. HW Abstraction HW Abstraction Prescient Instruction Prefetch Prescient Instruction Prefetch o( s,t ) slack( i,s,t ) Instructions i t s Time (cycles) Tor M. Aamodt

  9. Spawn-Target Selection Spawn-Target Selection Tradeoffs Tradeoffs a 0.98 S&T highly correlated S&T highly correlated � � S and T should be far apart so S and T should be far apart so � � b slack is larger than memory latency. slack is larger than memory latency. 0.10 S- S ->T instruction footprint should fit >T instruction footprint should fit � � X in I- -cache; T cache; T- ->S should not. >S should not. in I foo() c 0.999 Tor M. Aamodt

  10. Quantifying Tradeoffs Quantifying Tradeoffs METRIC METRIC Aspect Quantified Aspect Quantified Reaching Probability accuracy Reaching Probability accuracy Posteriori Probability coverage Posteriori Probability coverage Expected Path Length Expected Path Length timeliness timeliness Path Length Variation Path Length Variation Path Footprint timeliness, necessity Path Footprint timeliness, necessity Tor M. Aamodt

  11. Spawn-Target Selection Spawn-Target Selection Algorithm Algorithm I-cache & edge profiling data estimated helper thread & main thread CPI � Inputs: Profile data, Inputs: Profile data, Partition large basic blocks � estimated CPI estimated CPI Summarize procedures Next procedure in bottom-up traversal of call graph � Compute metrics / Compute metrics / Done Use fast path algorithm to find path Coverage of all � expressions. basic blocks Compute RP, PP, path length mean spawn- -target value target value acceptable or no spawn & variance. pair found Update estimated # running helper function threads, and I-cache miss coverage function Select next block No suitable Set of spawn points points found: OUTPUT spawn Select earliest target within ½max � Select using greedy Select using greedy pts., target, max. prefetch distance No suitable prefetch � points heuristic heuristic Select set of spawn-points (compute I-cache footprint on- demand) Tor M. Aamodt

  12. Path Expressions Path Expressions � Regular expression describing all paths Regular expression describing all paths � between two points. between two points. a ( ( ( ) ( ) ) ) * ( , ) = ⋅ ⋅ ∪ ⋅ ⋅ ⋅ P a X A B C D E F B A Fast Path Expression Algorithm Fast Path Expression Algorithm D B • [Tarjan 1981] : general approach to X solving path problems efficiently. C E • Examples: solving Ax=b, shortest paths, data flow analysis. F Tor M. Aamodt

  13. Example: Example: Reaching Probability Reaching Probability a = concatenat ion [ ⋅ ] R R pq 1 2 P(A) = 0.98 = union [R ] ∪ + R p q 1 2 * 1 = closure [ ] R 1 1 − p P(D) = 0.90 P(B) = 0.10 X ( ( ( ) ( ) ) ) * P(C) = 1.00 P(E) = 1.00 ( , ) = ⋅ ⋅ ∪ ⋅ ⋅ ⋅ P a X A B C D E F B 1   [ ( , )] 0 . 98 0 . 10 P a X = ⋅ ⋅   1 . 0 ( 0 . 1 ( 0 . 0 ) 0 . 90 ( 1 . 0 )) ( 0 . 999 ) − + ⋅   P(F) = 0.999 0 . 97 ≅ Tor M. Aamodt

  14. Mappings Mappings Reaching Expected Reaching Expected Path Length Variance Path Length Variance Probability Probability Path Length Path Length v + pq X + w Y concatenation [R 1 •R R 2 ] concatenation [R 1 • 2 ] 2 pX ( 2 ) ( 2 ) + + + + p + p v X q w Y pX qY q   1 ∪ ∪ R − union [R 1 R 2 ] union [R 2 ]   1 − + + p q p q p     2 1 ( 2 ) pX + p v X pX   + 1* * ] closure [ R 1 ] closure [ R   1 1 1 − 1 − − − p p p p     Decompose problem: ∑ E[X| follow p ∈ ∈ R]•P[follow p ∈ ∈ R] ∈ ∈ ∈ ∈ Tor M. Aamodt

  15. Path Footprint Path Footprint 1 ( , ) ( ) ( , | ) ( , ) = ⋅ ¬ ⋅ F x y size v RP x v y RP v y α β ∑ ( , ) RP x y v Tor M. Aamodt

  16. Accuracy: vs. Monte Carlo Accuracy: vs. Monte Carlo 100% 400 Measured Measured 98% 200 96% 94% 0 94% 96% 98% 100% 0 200 400 Predicted Predicted Reaching Probability Expected Path Length 100 300 Measured Measured 200 50 100 0 0 0 100 200 300 0 50 100 Predicted Predicted Path Length Variation Path Footprint Tor M. Aamodt

  17. Accuracy: vs. Execution Accuracy: vs. Execution 100% 400 Measured Measured 80% 200 60% 0 94% 96% 98% 100% 0 200 400 Predicted Predicted Reaching Probability Expected Path Length 100 300 Measured Measured 200 50 100 0 0 0 50 100 0 100 200 300 Predicted Predicted Path Length Variation Path Footprint Tor M. Aamodt

  18. Spawn-Target Selection Spawn-Target Selection Algorithm Algorithm I-cache & edge profiling data estimated helper thread & main thread CPI Partition large basic blocks Summarize procedures Next procedure in bottom-up traversal of call graph Done Use fast path algorithm to find Coverage of all path expressions. basic blocks Compute RP, PP, path length acceptable or no mean & variance. pair found Update estimated # running helper threads, and I-cache miss coverage Select next block No suitable Set of spawn points points found: Select earliest target within OUTPUT spawn ½max prefetch distance pts., target, No suitable max. prefetch points Select set of spawn-points (compute I-cache footprint on- demand) Tor M. Aamodt

  19. Selection Algorithm Details Selection Algorithm Details Loop over basic blocks (ranked by E[#i-misses]) 1. Select target, then select spawn value(spawn) ∝ ∝ ∝ ∝ PP•RP•E[postfix size]•P[miss]•P[!evicted]•P[# ht < # ctx] 2. Update coverage metrics • # helper threads = ∑ PP(s,i) • P[still running] • # i-cache misses -= PP(t,s) •PP(t,i) • P[still running] • (#i-cache misses) Tor M. Aamodt

Recommend


More recommend