lecture 13 things of interest
play

Lecture 13: Things of Interest G63.2011.002/G22.2945.001 November - PowerPoint PPT Presentation

Lecture 13: Things of Interest G63.2011.002/G22.2945.001 November 30, 2010 Debugging Instrumentation Profiling and Hardware Outline Debugging Instrumentation Profiling and Hardware Debugging Instrumentation Profiling and Hardware Today


  1. Lecture 13: Things of Interest G63.2011.002/G22.2945.001 · November 30, 2010 Debugging Instrumentation Profiling and Hardware

  2. Outline Debugging Instrumentation Profiling and Hardware Debugging Instrumentation Profiling and Hardware

  3. Today “Odds and Ends” • Tools (emphasis on Linux, non-proprietary) • Ways to use them • Learn new details about hardware along the way • . . . across our four ways of high-performance computing: Serial, OpenMP, MPI, GPU Will post slides, video (hopefully) Questions about your final project? → Ask us! We’re happy to help! Debugging Instrumentation Profiling and Hardware

  4. Outline Debugging Instrumentation Profiling and Hardware Debugging Instrumentation Profiling and Hardware

  5. Debugging Bad program behavior: • Wrong result • Segmentation fault • Run-time errors • assert() violations ( < assert.h > , -DNDEBUG ) Desired Insight: • Where? (Source code location) • When? (Execution History) • History within function • Call stack • With what data ? (Variable contents, etc.) • → Why? (And how do I fix it?) Key Actions: Attach to inferior , trace ( ptrace() ) its execution Debugging Instrumentation Profiling and Hardware

  6. Debugging Bad program behavior: • Wrong result • Segmentation fault • Run-time errors • assert() violations ( < assert.h > , -DNDEBUG ) Desired Insight: • Where? (Source code location) • When? (Execution History) • History within function • Call stack • With what data ? (Variable contents, etc.) • → Why? (And how do I fix it?) What about bugs that aren’t reproducible? Key Actions: Attach to inferior , trace ( ptrace() ) its execution Debugging Instrumentation Profiling and Hardware

  7. Debugging with GDB: Summary • Three main usage patterns: • Run-until-crash (‘Post-mortem’) • Core dump • Break-and-trace • -g vs -O n • X , Ctrl Ctrl A • Step into ( s ), step over ( n ), finish ( fin ) • p data to look at variables Debugging Instrumentation Profiling and Hardware

  8. Other Debuggers: DDD GNU Data Display Debugger (Free) Debugging Instrumentation Profiling and Hardware

  9. Other Debuggers: TotalView TotalView (Proprietary) Debugging Instrumentation Profiling and Hardware

  10. Other Debuggers: DDT Allinea Distributed Debugging Tool (Proprietary) Debugging Instrumentation Profiling and Hardware

  11. Outline Debugging Instrumentation Profiling and Hardware Debugging Instrumentation Profiling and Hardware

  12. Question Problem: Debugging only deals with problems when they cause observable wrong behavior (e.g. a crash). Doesn’t find latent problems. Suggested solution: Monitor program behavior (precisely) while it’s executing. Possible? Debugging Instrumentation Profiling and Hardware

  13. What is Instrumentation? What is Instrumentation? A.k.a. how does Valgrind work? x86(-64) Binary IR (SSA) Tool x86(-64) Binary Tools: • Memcheck (find pointer bugs) • Massif (find memory allocations) • Cachegrind/Callgrind (find cache misbehavior) • Helgrind/DRD (find data races) Debugging Instrumentation Profiling and Hardware

  14. Outline Debugging Instrumentation Profiling and Hardware Debugging Instrumentation Profiling and Hardware

  15. Profilers Slow program execution: • Poor memory access pattern • Expensive processing (e.g. division, transcendental functions) • Control overhead (branches, function calls) Desired Insight: • Where is time spent? (Source code location) • When? (Execution History) • Call stack • What is the limiting factor? Main Types of Profilers: • Exact, Sampling • Hardware, Software Debugging Instrumentation Profiling and Hardware

  16. Reflections on Profilers Sampling Exact + Fast - Slow - Noisy + Exact (takes time to converge!) No free lunch. But: No exact machine-level profiler! Debugging Instrumentation Profiling and Hardware

  17. Making sense of OProfile sample counts What do OProfile sample counts mean? Individually: not much! → Ratios make sense! What kind of ratios? • (Events in Routine 1)/(Events in Routine 2) • (Events in Line 1)/(Events in Line 2) • (Count of Event 1 in X)/(Count of Event 2 in X) Always ask: Sample count sufficiently converged? Debugging Instrumentation Profiling and Hardware

  18. OProfile: Examples I • (DCU LINES IN or L1D REPL) / INST RETIRED L1 miss rate, target: small, location understood (seen) • L2 LINES IN / INST RETIRED L2 miss rate, target: small • INST RETIRED / CPU CLK UNHALTED Instructions per clock, target > 1 (seen) • CYCLES L1I MEM STALLED / CPU CLK UNHALTED Instruction fetch stalls. Should never happen–means CPU could not predict where code is going. ( → pipeline stall) • BR IND CALL EXEC / INST RETIRED Fraction of indirect calls (virtual table lookups) Debugging Instrumentation Profiling and Hardware

  19. OProfile: Examples II • L1D CACHE LD / CPU CLK UNHALTED Fraction of time the L1 load/store buffers are full • STORE BLOCK / CPU CLK UNHALTED Fraction of cycle CPU is blocked waiting to be able to write to memory • PAGE WALKS / CPU CLK UNHALTED Cycles spent waiting for page table walks (TLB miss penalty) • DTLB MISSES / INST RETIRED Data TLB miss rate Debugging Instrumentation Profiling and Hardware

  20. Virtual Memory Virtual address space Physical address space 0x00000000 0x00010000 text 0x00000000 0x10000000 data 0x00ffffff stack page belonging to process page not belonging to process 0x7fffffff Debugging Instrumentation Profiling and Hardware

  21. Virtual Memory Linear address: 31 24 23 16 15 8 7 0 10 10 12 page directory page table ... ... 4K memory page 32 bit PD ... entry 32 bit PT ... entry ... ... 32* CR3 *) 32 bits aligned to a 4-KByte boundary (One page directory per process.) Debugging Instrumentation Profiling and Hardware

  22. Virtual Memory Linear address: 31 24 23 16 15 8 7 0 10 10 12 page directory page table ... ... 4K memory page 32 bit PD ... entry 32 bit PT ... entry ... ... 32* CR3 *) 32 bits aligned to a 4-KByte boundary . . . and two extra memory accesses per memory access ? (One page directory per process.) Debugging Instrumentation Profiling and Hardware

  23. Caching the Page Table TLB hit physical address virtual address TLB TLB miss TLB write page table hit page table page not present page table write disk Debugging Instrumentation Profiling and Hardware

  24. Caching the Page Table TLB hit physical address virtual address TLB TLB miss TLB write page table hit page table page not present page table write disk What leads to TLB flush? TLB flush ⇒ Cache flush? Debugging Instrumentation Profiling and Hardware

  25. Influencing TLB performance What to do if limited by TLB performance? • Access fewer pages: • Increase locality • Problem: fragmented memory! • Default x86 page granularity: 4 kiB Virtual address space Physical address space “Huge” pages also exist: 2 MiB 0x00000000 0x00010000 text 0x00000000 Obtaining huge-page memory: (Linux 0x10000000 data only) • mount -t hugetlbfs none /mnt/huge 0x00ffffff stack • Create /mnt/huge/myfile page belonging to process 0x7fffffff page not belonging to process • mmap() that file. → 5–10% gain on matmul But: Huge pages are shared, scarce resource! Debugging Instrumentation Profiling and Hardware

  26. OProfile: Also for multi-processor programs • EXT SNOOP / INST RETIRED Fraction of instructions causing retrieval of modified cache line from other core • (L1D CACHE LOCK DURATION + 20 × L1D CACHE LOCK)/CPU CLK UNHALTED Fraction of cycles spent waiting for synchronized (“atomic”) access to memory Debugging Instrumentation Profiling and Hardware

  27. Atomic Operations Collaborative (inter-block) Global Memory Update: Read Increment Write Debugging Instrumentation Profiling and Hardware

  28. Atomic Operations Collaborative (inter-block) Global Memory Update: Read Increment Write Interruptible! Debugging Instrumentation Profiling and Hardware

  29. Atomic Operations Collaborative (inter-block) Global Memory Update: Read Increment Write Interruptible! Interruptible! Debugging Instrumentation Profiling and Hardware

  30. Atomic Operations Collaborative (inter-block) Global Memory Update: Read Increment Write Interruptible! Interruptible! Atomic Global Memory Update: Read Increment Write Debugging Instrumentation Profiling and Hardware

  31. Atomic Operations Collaborative (inter-block) Global Memory Update: Read Increment Write Interruptible! Interruptible! Atomic Global Memory Update: Read Increment Write Protected Debugging Instrumentation Profiling and Hardware

  32. Atomic Operations Collaborative (inter-block) Global Memory Update: Read Increment Write Interruptible! Interruptible! Atomic Global Memory Update: Read Increment Write Protected Protected Debugging Instrumentation Profiling and Hardware

  33. Atomic Operations Collaborative (inter-block) Global Memory Update: Read Increment Write Interruptible! Interruptible! Atomic Global Memory Update: Read Increment Write Protected Protected How? OpenCL: atomic { add,inc,cmpxchg,. . . } (int *global, int value); Debugging Instrumentation Profiling and Hardware

Recommend


More recommend