using sampling to understand parallel program performance
play

Using Sampling to Understand Parallel Program Performance Nathan - PowerPoint PPT Presentation

Using Sampling to Understand Parallel Program Performance Nathan Tallent John Mellor-Crummey M. Krentel, L. Adhianto, M. Fagan, X. Liu Dept. of Computer Science, Rice University Parallel Tools Workshop TU Dresden Sept 26-27, 2011


  1. Using Sampling to Understand Parallel Program Performance Nathan Tallent John Mellor-Crummey M. Krentel, L. Adhianto, M. Fagan, X. Liu Dept. of Computer Science, Rice University Parallel Tools Workshop • TU Dresden • Sept 26-27, 2011 hpctoolkit.org Monday, September 26, 2011

  2. Advertisement: WHIST 2012 • International Workshop on High-performance Infrastructure for Scalable Tools — in conjunction with PPoPP 2012 — February 25-29, 2012 — New Orleans, LA, USA • All things tools: — special emphasis on performance and scalability — special emphasis on infrastructure • how do we build tools? • often not welcome in other venues • No more CScADS Tools Workshop; come to WHIST instead! whist-workshop.org 2 Monday, September 26, 2011

  3. For Measurement, Sampling is Obvious... • Instrumentation (synchronous) Instrumentation types • source code — monitors every instance of something • compiler-inserted — must be used with care • static binary • dynamic binary • instrumenting every procedure ➟ large overheads – may preclude measurement of production runs – encourages selective instrumentation ➟ blind spots • instrumentation disproportionately affects small procedures ( systematic error) • cannot measure at the statement level — note: specialized techniques may reduce overheads • sample instrumentation (switch btwn heavyweight & lightweight inst) • dynamically insert & remove binary instrumentation • Sampling (asynchronous) — provides representative and detailed measurements • assumes sufficient samples; no correlations (usually easy to satisfy) Use sampling when possible; use instrumentation when necessary. 3 Monday, September 26, 2011

  4. … but: Sampling is Also Questionable • How to attribute metrics to loops? — attribution to outer loops trivial with source-code instrumentation • How to attribute metrics to data objects? • How to collect call paths at arbitrary sample points? — stack unwinding can be hard: try using libunwind or GLIBC backtrace() in a SEGV handler for optimized code • How to collect call paths for languages that don’t use stacks? — perfect stack unwinding not sufficient • Will sampling miss something important? — it is possible to miss the first cause in a chain of events • How to pinpoint lock contention? — with instrumentation, it is easy to track lock acquire/release • Can sampling pinpoint the causes of lock contention? 4 Monday, September 26, 2011

  5. Outline • Motivation • Call Path Profiling • Pinpointing Scaling Bottlenecks • Blame Shifting — parallel idleness and overhead in work stealing — lock contention — load imbalance • Call Path Tracing • Conclusions 5 Monday, September 26, 2011

  6. Sampling-based Call Path Profiling is Easy...* Measure & attribute costs in calling context • Use timer or hardware counter as a sampling source • Gather calling context using stack unwinding Call path sample Calling Context Tree (CCT) return address “main” return address return address instruction pointer sample point Overhead proportional to sampling frequency... ...not call frequency CCT size scales well 6 Monday, September 26, 2011

  7. … but: Hard to Unwind From Async Sample • Asynchronous sampling: — ✓ overhead proportional to sampling frequency • low, controllable • no sample => irrelevant to performance — ✓ overhead effects are not systematic — ✗ must be able to unwind from every point in an executable • (use call stack unwinding to gather calling context) • Why is unwinding difficult? Requires either: — frame pointers (linking every frame on the stack) • omitted in fully optimized code — complete (and correct) unwind information • omitted for epilogues • erroneous (optimizers fail to maintain info while applying xforms) • missing for hand-coded assembly or partially stripped libraries often spend significant time in such routines! Dynamically analyze binary to compute unwind information 7 Monday, September 26, 2011

  8. Unwinding Fully Optimized Parallel Code • Identify procedure bounds — for dynamically linked code, do at runtime — for statically linked code, do at compile time • Compute (on demand) unwind recipes for a procedure: — scan the procedure’s object code, tracking the locations of • caller’s program counter • caller’s frame and stack pointer — create unwind recipes between pairs of frame-relevant instructions • Processors: x86-64, x86, Power/BGP, MIPS (SiCortex ☹ ) • Results — accurate call path profiles — overheads of < 2% for sampling frequencies of 200/s PLDI 2009. Distinguished Paper Award . 8 Monday, September 26, 2011

  9. Attributing Measurements to Source • Compilers create semantic gap between binary and source — call paths differ from user-level • inlining, tail calls — loops differ from user-level • software pipelining, unroll & jam, blocking, etc. • Must bridge this semantic gap — to be useful, a tool should • attribute binary-level measurements to source-level abstractions • How? — cannot use instrumentation to measure user-level loops Statically analyze binary to compute an object to source-code mapping 9 Monday, September 26, 2011

  10. Attribution to Static & Dynamic Context 1-2% overhead costs for • inlined procedures • loops • function calls in full context PLDI 2009. Distinguished Paper Award . Monday, September 26, 2011

  11. Outline • Motivation • Call Path Profiling • Pinpointing Scaling Bottlenecks • Blame Shifting — parallel idleness and overhead in work stealing — lock contention — load imbalance • Call Path Tracing • Conclusions 11 Monday, September 26, 2011

  12. The Challenges of Scale: O(100K) cores • What if my app doesn’t scale? Where is the bottleneck? • First step: Performance tools must scale! — measure at scale • sample processes (SPMD apps) simple and – record data on a process with probability p effective – simplification of Gamblin et al., IPDPS ’08 — analyze measurements • analyze measurements in parallel • data structure has space requirements of O(1 call path tree) — present performance data (without requiring an Altix) • use above data structure • Next step: Deliver insight — identify scaling bottlenecks in context… • use detailed sampling-based call path profile 12 Monday, September 26, 2011

  13. Pinpointing & Quantifying Scalability Bottlenecks Average CCT Q Average CCT P = P × − Q × 400 s. 600 s. P Q 200 s. SC 2009 Weak scaling : no coefficients Strong scaling : red coefficients Coarfa, Mellor-Crummey, Froyd, Dotsenko. ICS’07 13 Monday, September 26, 2011

  14. FLASH: Top-down View of Scaling Losses Weak scaling on BG/P 256 to 8192 cores 21 % of scaling loss is due to looping over all MPI ranks to build adaptive mesh 14 Monday, September 26, 2011

  15. Improved Flash Scaling of AMR Setup Note: lower is better Graph courtesy of Anshu Dubey, U Chicago 15 Monday, September 26, 2011

  16. Outline • Motivation • Call Path Profiling • Pinpointing Scaling Bottlenecks • Blame Shifting — parallel idleness and overhead in work stealing — lock contention — load imbalance • Call Path Tracing • Conclusions 16 Monday, September 26, 2011

  17. Blame Shifting • Problem: sampling often measures symptoms of performance losses rather than causes — worker threads waiting for work — threads waiting for a lock — MPI process waiting for peers in a collective communication • Approach: shift blame from victims to perpetrators • Flavors — active measurement — post-mortem analysis only 17 Monday, September 26, 2011

  18. Cilk: An Influential Multithreaded Language cilk int fib(n) { spawn : asynchronous call; if (n < 2) return n; creates a logical task that else { only blocks at a sync int x, y; x = spawn fib(n-1); y = spawn fib(n-2); f(n) sync; return (x + y); f(n-1) f(n-2) } } f(n-2) f(n-3) f(n-3) f(n-4) spawn + recursion ( fib ): quickly creates significant ... ... ... ... ... ... logical parallelism ... ... To map logical tasks to compute cores: lazy thread creation + work-stealing scheduler 18 Monday, September 26, 2011

  19. Call Path Profiling of Cilk: Stack ≠ Path Problem: Work stealing thread 1 f(n) separates source-level thread 2 thread 3 f(n-1) f(n-2) calling contexts in space and time f(n-2) f(n-3) f(n-3) f(n-4) ... ... ... ... ... ... ... ... Consider thread 3: • physical call path f(n-1) f(n-3) ... • logical call path f(n) f(n-1) f(n-3) ... Logical call path profiling: Recover full relationship between physical and source-level execution 19 Monday, September 26, 2011

  20. What If My Cilk Program Is Slow? • Possible problems: — parallel idleness: parallelism is too coarse-grained — parallel overhead: parallelism is too fine-grained — idleness and overhead: wrong algorithm • Try logical call path profiling: You spent time here! — measuring idleness while (noWork) { • samples accumulate t = random thread; in the scheduler! try stealing from t; • no source-level insight! } — measuring overhead save-live-vars(); offer-task(); • how?! user-work(); • work and overhead are both retract-task(); machine instructions PPoPP 2009 Goal: Insightfully attribute idleness IEEE Computer and overhead to logical contexts 12/09 20 Monday, September 26, 2011

Recommend


More recommend