Cray Tools, an overview 8th International Parallel Tools Workshop Stuttgart, Germany, 1st October 2014 Stefan Andersson Cray onsite support at HLRS C O M P U T E | S T O R E | A N A L Y Z E
Introduction ● Cray develops several tools for their XE/XK and XC computers ● There is lot of effort going into the development ● Several of the tools are ‘stand-alone’ solutions, being developed for a specific problem ● STAT, ATP ● IOBUF (includes serial IO monitoring) ● MPIIO profiling ● Other tools will interact in order to be more efficient or to create new solution for a problem ● CCE providing ‘hooks’ for profiling on loop level ● Reveal using CCE listing information and CrayPat Profiling C O M P U T E | S T O R E | A N A L Y Z E 2
Which tools does Cray develop ● It doesn’t make sense to develop tools where a good tool already exists on the market DDT and Totalview are good examples ● Cray’s tools are either ● Something new, like Reveal ● Concentrate on a solution to a specific issue, like STAT ● Are part of the development process, like MPIIO Stats ● Comes out of benchmarking, like IOBUF ● Cray also collaborate with different sites in developing tools C O M P U T E | S T O R E | A N A L Y Z E 3
CCE : Cray Compiler Environment ● The compiler is in general not considered a ‘tool’, but in fact it is the most important piece of user software ● Compiles and Link the user application ● Feedback about the application ● Code errors ● How optimization was done/or not done (lst file) ● Providing ‘hooks’ into different levels of the application, to which other tools can attach ● Functions ● Loops ● This makes CCE the ‘centerpiece’ in Cray’s Tools Strategies ● CCE can adapt rather quickly to user/tool needs ● All tools will work with other Compilers, but there might be some limitations The goal is not to force a user to use CCE, but to provide extensions where it makes sense C O M P U T E | S T O R E | A N A L Y Z E 4
Overview : Tools infrastructure (selection) Light weight In-depth At most relinking. Get a Recompile/Relink. Provides first picture of a detailed information at user performance or problems routine level. during execution. Debugging • ATP • lgdb with ccdb Get your code up and • STAT • Fast track running correctly. • DDT • Totalview • Intel Inspector • CrayPAT-lite • CrayPAT Profiling • Profiler library • Apprentice2 Locate performance • IOBUF • Reveal bottlenecks. • MPIIO Stats • Intel Vtune C O M P U T E | S T O R E | A N A L Y Z E 5
Easy of Use : CrayPAT evolving over time ● CrayPat is a is not easy to get started with : ● Man pages for intro_craypat, pat_build and pat_report has ~4000 lines ● ~70 environment variables ● A lot of arguments available ● Output is configurable to the very last character ● Improvement in ‘Easy of use’ over time : 1. Interactive help tool : pat_help 2. Introduction of the ‘Automatic Profiling Analysis’ approach Guides the user to a traced run in two step ● First a sampling run is done ● Based on this run, a traced application is generated and run ● User can interact with the process and do changes 3. Introduction of CrayPAT-light ● Profiling is transparent to the user : No changes in the build and execution process ● Users can still use ‘plain’ CrayPAT C O M P U T E | S T O R E | A N A L Y Z E 6
Debugging Tools on the Cray XC30 C O M P U T E | S T O R E | A N A L Y Z E
The porting optimization Cycle Port or update your application to the XC30 Debug your application (get right results). ● Stack Trace Analysis Tool ( STAT ) ● Abnormal Termination Processing (ATP ) ● Fast Track Debugger ( FTD ) ● Allinea DDT ● lgdb, (ccdb) Profile your application for performance. ● Cray performance analysis toolkit CrayPat. ● CrayPat lite for easier profiling. ● Cray Profiler Library C O M P U T E | S T O R E | A N A L Y Z E 8
Stack Trace Analysis Tool (STAT) For when nothing appears to be happening… C O M P U T E | S T O R E | A N A L Y Z E
Stack Trace Analysis Tool (STAT) ● Stack Trace Analysis Tool (STAT) is a cross-platform tool from the University of Wisconsin-Madison. ● Gathers and merges stack traces from a running application’s parallel processes. ● Creates call graph prefix tree ● Compressed representation ● Scalable visualization ● Scalable analysis ● It is very useful when application seems to be stuck/hung ● Full information including use cases is available at http://www.paradyn.org/STAT/STAT.html ● Scales to many thousands of concurrent process. C O M P U T E | S T O R E | A N A L Y Z E 10
Stack Trace Merge Example C O M P U T E | S T O R E | A N A L Y Z E 11
Merged Stack C O M P U T E | S T O R E | A N A L Y Z E
STAT Advantages ● Always available as linked into an application ● Doesn’t use CPU cycles if not needed/activated ● Attaches to a running program at scale ● Can create several snapshot during a run ● No extra license costs C O M P U T E | S T O R E | A N A L Y Z E 13
Abnormal Termination Processing (ATP) For when things break unexpectedly… (Collecting back-trace information) C O M P U T E | S T O R E | A N A L Y Z E
ATP Description ● Abnormal Termination Processing is a lightweight monitoring framework that detects crashes and provides more analysis instead of silently terminating. ● Designed to be so light weight it can be used all the time with almost no impact on performance. ● Almost completely transparent to the user ● Requires atp module loaded during compilation (usually included by default) ● Output controlled by the ATP_ENABLED environment variable (set by user). ● Tested at scale (tens of thousands of processors) ● ATP rationalizes parallel debug information into three easier to user forms: A single stack trace of the first failing process to stderr 1. A visualization of every processes stack trace when it crashed 2. A selection of representative core files for analysis 3. C O M P U T E | S T O R E | A N A L Y Z E 15
ATP Usage ● Job scripts must include the following variable ● export ATP_ENABLED=1 ATP respects ulimits on corefiles. ● ulimit –c unlimited ● After abnormal termination the application will not simply crash but proceed with the ATP analysis instead. ● Backtrace of first crashing process is passed to stderr and the merged backtrace of all procs is in atpMergedBT.dot Trace back of crashing process Core files are being generated. C O M P U T E | S T O R E | A N A L Y Z E 16
Viewing the results after the crash ● The merged backtrace is inspected via STAT: > module load stat > stat-view atpMergedBT.dot ● The core files can be inspected with a debugger like gdb or Allinea DDT . C O M P U T E | S T O R E | A N A L Y Z E 17
Fast Track Debugging For getting to the problem more quickly… C O M P U T E | S T O R E | A N A L Y Z E
The Problem ● Debug compilations eliminate optimizations ● Today's machines really need optimizations ● Slows down execution ● Problem might disappear ● Compile such that both debug and non-debug (optimized) versions of each routine are created. Use –Gfast instead of –g with the Cray compiler. Check the man pages. ● Linkage such that optimized versions are used by default ● Debugger overrides default linkage when setting breakpoints and stepping into functions ● Supported by DDT and lgdb. C O M P U T E | S T O R E | A N A L Y Z E 19
A Closer Look at How FTD Works optimized binary code source code difuze() call difuze(…) call difuze(…) dbg$difuze() call interf(…) call interf(…) debug code interf() subrountine difuze(…) call difuze(…) dbg$interf() subrountine interf(…) call interf(…) Breakpoint requested in interf(), placed in interf_debug() Jmp inserted as part of breakpoint planting C O M P U T E | S T O R E | A N A L Y Z E 20
Profiling : CrayPAT C O M P U T E | S T O R E | A N A L Y Z E
CrayPAT’s Design Goals ● Assist the user with application performance analysis and optimization ● Help user identify important and meaningful information from potentially massive data sets ● Help user identify problem areas instead of just reporting data ● Bring optimization knowledge to a wider set of users ● Focus on ease of use and intuitive user interfaces ● Lightweight and automatic program instrumentation ● Automatic Profiling Analysis mode to bootstrap the process ● Target scalability issues in all areas of tool development ● Work on user codes at realistic core counts with thousands of processes/threads ● Integrate into large codes with millions of lines of code ● Be a universal tool ● Basic functionality available to all compilers on the system ● Additional functionality available from the Cray compiler C O M P U T E | S T O R E | A N A L Y Z E 22 .
Recommend
More recommend