Bursty Tracing: A Framework for Low-Overhead Temporal Profiling Martin Hirzel Trishul Chilimbi hirzel@colorado.edu trishulc@microsoft.com FDDO4 December 2001 Austin, Texas
“Low-overhead temporal profiling” • Low overhead – Intended for dynamic optimization systems – Profile overhead must be recovered by optimization • Temporal profiling – Trend in profiling literature: discover more causality (path profiling, calling context trees, etc. ) – Temporal profiles expose more optimization opportunities 2
Arnold-Ryder profiling framework (a) (b) entry check A A A’ B B B’ back− edge checking check instrumented code code original procedure modified procedure (Arnold−Ryder) • Counter nCheck 1 • Sampling rate r = nCheck 0 + 1 • Implemented in Jikes RVM (Java on PowerPC) 3
Why longer bursts • Arnold-Ryder framework isolates events by loop back-edges, calls, and returns • Example: for ( i = 1; i < n ; i ++) if ( . . . ) f (); else g (); • Temporal relationships interesting for optimization: – Single-entry multiple-exit regions – Field reordering 4
Contributions • Longer bursts – Our framework captures temporal relationships across loop back-edges, calls, and returns. • x86 binaries – We report experiences with the framework in an alternative setting with different advantages and disadvantages. • Overhead reduction techniques – We eliminate some of the checks at procedure entries and at loop back-edges. 5
Talk outline • Introduction • Methodology – Longer bursts – Overhead reduction by eliminating checks • Evaluation – Overhead – Profile quality • Conclusion 6
Longer bursts (a) (b) entry check A A A’ B B B’ back− edge checking check instrumented code code original procedure modified procedure (longer bursts) • Counters nCheck and nInstr nInstr 0 • Sampling rate r = nCheck 0 + nInstr 0 • Implemented using Vulcan (x86 binaries) 7
Fewer checks • Goal: reduce overhead • Starting point: 6-35% overhead in our setting with checks on all procedure entries and loop back-edges • Constraint: never recurse or loop for unbounded amount of time without check • Remark: analogous to thread-yield points, gc-safe points, asynchronous-exception points 8
Eliminating entry checks substitute check main match insert_after expand join ~symbols delete_digram 9
Eliminating entry checks 3 substitute 1 check 0 2 main match 1 3 insert_after expand 2 4 join ~symbols 3 delete_digram � C = f ∈ N | ¬ is leaf ( f ) ∧ ( is root ( f ) ∨ addr taken ( f ) ∨ � recursion from below ( f )) 10
Eliminating loop back-edge checks • Tight inner loops – Checking gets expensive relative to time spent in original code – Statically optimized, not much opportunity for dynamic optimization • Omit both checking and profiling for tight inner loops • k -boring loop: – No calls – At most k profiling events of interest 11
Evaluation: Overhead • overhead ( r ) = basic overhead + r · instr overhead % basic overhead 40 EC+L4 orig all checks intact orig L10 EC EN LN EL no checks on entry to leaf procedures EL L4 35 EC call−graph technique EN no checks on entry to any procedures L4 4−boring loop technique 30 L10 10−boring loop technique LN no checks on any loop back−edges EC+L4 call−graph and 4−boring loop techniques 25 EC+L4 orig L10 EC EN LN EL L4 20 15 EC+L4 EC+L4 orig L10 EL EC EN LN L4 orig L10 EN LN EL EC L4 10 EC+L4 orig L10 EN LN EL EC L4 5 0 181.mcf 252.eon 300.twolf 305.espresso boxsim 12
Case study: Hot data stream profiles • data reference : dynamic load, ( pc , addr ) pair • data stream : sequence v of data references • heat of data stream : v. heat = v. length ∗ v. frequency • hot data stream : when v. heat > heat threshold (we set the threshold such that all hot data streams together cover 90% of the profile) • hot data stream profile : set P of hot data streams and their heats � • overlap ( P, Q ) = min { v. heat P , v. heat Q } v ∈ P ∪ Q 13
• nCheck 0 : nInstr 0 10 20 30 40 50 60 0 % overlap 20:1 181.mcf 100:1 200:1 200:10 1000:10 2000:10 1000:50 5000:50 Evaluation: Overlap 20:1 252.eon 100:1 200:1 200:10 1000:10 2000:10 1000:50 5000:50 20:1 300.twolf 100:1 200:1 200:10 1000:10 2000:10 1000:50 5000:50 20:1 305.espresso 100:1 200:1 200:10 1000:10 2000:10 1000:50 5000:50 20:1 boxsim 100:1 200:1 200:10 1000:10 2000:10 1000:50 5000:50 14
Evaluation: Overlap orig all checks intact nCheck 0 : nInstr 0 = 1000:50 EL no checks on entry to leaf procedures EC call−graph technique EN no checks on entry to any procedures L4 4−boring loop technique L10 10−boring loop technique LN no checks on any loop back−edges EC+L4 call−graph and 4−boring loop techniques % overlap EC+L4 60 EC+L4 orig L10 EC+L4 EL EC EN LN L4 orig L10 EN LN EL EC orig L10 L4 EL EC EN LN L4 50 40 EC+L4 EC+L4 30 orig L10 EL EC EN LN L4 orig L10 EC EN LN EL L4 20 10 0 181.mcf 252.eon 300.twolf 305.espresso boxsim 15
Related work • Arnold, Ryder, A framework for reducing the cost of instrumented code , PLDI 2001 • Temporal profiling – Ball, Larus, Efficient path profiling , MICRO 1996 – Ammons, Ball, Larus, Exploiting hardware performance counters with flow and context sensitive profiling , PLDI 1997 – Larus, Whole program paths , PLDI 1999 – Chilimbi, Efficient representations and abstractions for quantifying and exploiting data reference locality , PLDI 2001 16
Conclusions • Bursty tracing can collect temporal profiles online – General, low-overhead, deterministic – Flexible trade-off between sampling rate, overhead, and burst-length – Temporal • Future work – Prefetching hot data streams – Eliminating more loop back-edge checks – Improving profile quality further 17
Recommend
More recommend