debugging scalable applications on the xt
play

Debugging Scalable Applications on the XT May 2nd 2009 Chris - PowerPoint PPT Presentation

Debugging Scalable Applications on the XT May 2nd 2009 Chris Gottbrath Director, Product Management Debugging Scalable Applications Intro Challenges Products Scalability Interactive Subset Debugging Batch


  1. Debugging Scalable Applications on the XT May 2nd 2009 Chris Gottbrath Director, Product Management

  2. Debugging Scalable Applications • Intro – Challenges – Products • Scalability – Interactive Subset Debugging • Batch Environments – TVScript • Long Distance Collaborations – Remote Display Client • Memory Limitations – Memory Debugging • A look forward – RedZones – ReplayEngine • Questions TotalView Technologies –Proprietary– Plans Subject to Change without Notice 2

  3. HPC Debugging Challenges • There are different kinds of challenges – Technical – Educational – Organizational • It seems to me that they revolve around 3 C’s – Concurrency – Complexity – Collaboration TotalView Technologies –Proprietary– Plans Subject to Change without Notice 3

  4. Challenges: Concurrency • Distributed multi-process – Processes may be doing the same thing or different things – Data is distributed across the cluster – Behavior can sometimes be hard to reproduce – A hung process can sometimes be hard to differentiate from a hung node • Hybrid and/or multi-threaded – Behavior can be hard to reproduce – May introduce a second tier of parallelism • Scalability – Runs may include tens or hundreds of thousands of threads of execution • Performance of the user’s program • Performance of tool • Details can overwhelm the user – How do the users want to interact with these large jobs • Lightweight tools? • Work with a subset of the processes? • Fully featured debugging on the full scale jobs? TotalView Technologies –Proprietary– Plans Subject to Change without Notice 4

  5. Challenges: Complexity • Software tool chain – Languages and new language constructs – Multiple compilers and platforms • Hardware and runtime – Available node memory – Processor characteristics (with things like the Cell) – What facilities does the runtime provide • Breaking new ground – The “right” answers may be unknown • Validation from previous models TotalView Technologies –Proprietary– Plans Subject to Change without Notice 5

  6. Challenges: Community • Codes are developed by large teams – Train team members on the code and tools and platforms – Share the most effective techniques – Coordinate troubleshooting with the appropriate experts within the team • Teams may be highly distributed – Geographically and organizationally – Debugging may happen from across the hall or across the globe • Management of system resources – Balancing development and production needs – Problems can occur at production scale and with production datasets • Should users be allowed to troubleshoot in production queues and at production scale? – Debugging needs to be able to work with different queue policies TotalView Technologies –Proprietary– Plans Subject to Change without Notice 6

  7. Solutions • Product Overview – TotalView – MemoryScape – ReplayEngine • Large Scale Concurrency – Interactive Subset Debugging • Batch Environments – Batch Debugging with TVScript • Collaboration – Long Distance Remote Debugging • Memory Limitations – Memory Debugging TotalView Technologies –Proprietary– Plans Subject to Change without Notice 7

  8. TotalView debugger Develop an understanding of program behaviour • C, C++, Fortran 77, Fortran90, UPC – Complex Language Features • Wide compiler and platform support – Cray XT – Linux x86, x86-64 – Others: Solaris, BG, Cell, Mac, etc.. • Parallel debugging – MPI, pthreads, OpenMP, UPC • Memory debugging capabilities – Integrated into the debugger • Remote Display Client • Graphical User Interface – Simple things are easy – Advanced operations are available – Visualization • Scripting – CLI and TVScript TotalView Technologies –Proprietary– Plans Subject to Change without Notice 8

  9. MemoryScape Simple to use, intuitive memory debugging • What is MemoryScape? – Streamlined – Lightweight – Intuitive – Collaborative – Memory Debugging • Features – Shows • Memory errors • Memory status – Tech • Memory leaks • Low overhead • Buffer overflows • No Instrumentation – MPI memory debugging Interface — – Remote memory debugging Inductive ● Collaboration ● Multi-process ● TotalView Technologies –Proprietary– Plans Subject to Change without Notice 9

  10. ReplayEngine Radically simplified debugging Enhances debugging experience • Add-on to TotalView (version 8.6) • Captures execution history • Record all external input to program • Records internal sources of non-determinism • Replays execution history • Examine any part of the execution history • Step as easily back through code as you do forwards • Jump to points of interest • Everything is managed by the tool • The user just says where they want to go • Supported on Linux x86 and x86-64 • Supports MPI, Pthreads, and OpenMP • TotalView Technologies –Proprietary– Plans Subject to Change without Notice 10

  11. Large Scale Concurrency • Dealing with Terra and Peta Scale – Challenging for interactive tools – Multiple approaches • Interactive Subset Debugging • Ongoing Tool Scalability Improvements • Scalable Display of Data TotalView Technologies –Proprietary– Plans Subject to Change without Notice 11

  12. Attaching the Debugger to Part of a Job • Debug a subset of the processes that make up the job – Sometimes the user does not need to control and see every process to understand the behavior or id the defect • The subset can be changed at any time – Can narrow, expand or shift focus • Uncouples interactive performance from job size – After the subset operation completes – Interactive performance depends on subset size • Supports the use of lightweight tools – LLNL’s STAT • Recent work – 1 k of 16k acts like 1k of 1k – BG subset support – Enhanced support for tools integration TotalView Technologies –Proprietary– Plans Subject to Change without Notice 12

  13. Unprecedented Scalability for Interactive Tool • Techniques for using TotalView at scale – Subset attach, message queue display, cycle detection, call graph, view data across processes and threads, etc. • Current scalability (tested and verified) – Users debug 1 to 4,000 processes regularly – Many operations at 1k take less than a few seconds – Higher scale, depending on the system and application • Blue Gene: up to 16k processes • Linux cluster: up to 6k processes • Cray XT : up to 4k processes • Actively working on performance and scalability – Improvements come from rigorous profiling and timing – Requires close collaboration with both customers and other vendors • Partnership program TotalView Technologies –Proprietary– Plans Subject to Change without Notice 13

  14. Scalable Display of Data 14 14 TotalView Technologies –Proprietary– Plans Subject to Change without Notice 14

  15. Debugging in Batch Environment • Batch Environments Support – Many users – Non-interactive usage model • Upload data and code • Compile • Submit • Wait • Run • Download results – Interactive queues • Some sites • Smaller scale • How to do debugging in this model? – printf() – Manual TotalView CLI scripting – TVScript TotalView Technologies –Proprietary– Plans Subject to Change without Notice 15

  16. Batch Debugging with TVScript • New in TotalView 8.6 • User extensible script to drive a target program to completion under the TotalView debugger. • Handles all the event management overhead so the user doesn’t have to. • Allows – You to gather debugging data in the “regular queue” without interactivity while the program runs – You to do very structured and reproducible kinds of problem analysis – You to “narrow down” problems so that you can do focused interactive debugging as a second stage • How does it work – You define breakpoints – You associate operations with those breakpoints such as • Print a specific variable • Print all local variables • Stack trace • Count • Set other breakpoints, watchpoints • Set data within the program – You submits the script into the batch queue and it runs without any user interaction – Output is gathered into a single debugging output file TotalView Technologies –Proprietary– Plans Subject to Change without Notice 16

  17. Collaboration • Diverse collaborations – Scientific or technical domain experts – Computer scientists – System consultants – Grad students of various flavors • Geographically Distributed • Enabling access – Long Distance Remote Debugging • Sharing Data – Reports and Exports TotalView Technologies –Proprietary– Plans Subject to Change without Notice 17

Recommend


More recommend