de bug g ing l a rg e s c a le a nd hybrid p a ra lle l c
play

De bug g ing L a rg e S c a le a nd Hybrid P a ra lle l C ode Ma - PowerPoint PPT Presentation

De bug g ing L a rg e S c a le a nd Hybrid P a ra lle l C ode Ma rk O'C onnor m a rk@ a lline a .c om L e a d De ve lope r www.allinea.com Didn't a s k for pa ra lle lis m www.allinea.com ... but g ot pa ra lle lis m www.allinea.com


  1. De bug g ing L a rg e S c a le a nd Hybrid P a ra lle l C ode Ma rk O'C onnor m a rk@ a lline a .c om L e a d De ve lope r www.allinea.com

  2. Didn't a s k for pa ra lle lis m www.allinea.com

  3. ... but g ot pa ra lle lis m www.allinea.com

  4. Inte re s ting tim e s ... • A recent history of parallel computing – Increasing core counts – Increasing cores per node – Established programming models • Near-100% of HPC software using MPI and/ or OpenMP • A close match of software to hardware - good portability – The challenge of a pplic a tion s c a la bility remains! • Times change ... – GPUs entering HPC – power, performance, … – Massive multi-core clusters – with many GPUs – The challenge of hybrid s oftw a re is here! www.allinea.com

  5. E xploiting te c hnolog y • “ S o ftw are has b e c o m e the #1 ro ad b lo c k … M a ny ap p lic atio ns w ill ne e d a m ajo r re d e s ig n” - IDC HPC Update, June 2010 – Most ISV codes do not scale – High programming costs are delaying GPU usage • Development tools are a vital part of the solution www.allinea.com

  6. Alline a S oftw a re • UK based HPC tools company since 2001 – Allinea DDT – the scalable parallel debugger – Allinea OPT – the optimization tool for MPI and non-MPI – Allinea DDTLite – the parallel debugging plugin for Microsoft Visual Studio • Large European and US customer base – Ease of use – means tools get used – Users debugging regularly at all scales – at 1 or 100,000 cores – World's only Petascale debugger! www.allinea.com

  7. S om e C lie nts a nd P a rtne rs • Academic – Over 200 universities • Major research centres – ANL, EPCC, IDRIS, Juelich, NERSC, ORNL, • Aviation and Defence – Airbus, AWE, Dassault, DLR, EADS, ... • Energy – CEA, CGG Veritas, IFP , Total, ... • EDA – Cadence, Intel, Synopsys, ... • Climate and Weather – UK Met Office, Meteo France, NOAA ... www.allinea.com

  8. B a c kg round • Debugging: Good (aka A Necessary Evil) – Reproducing and fixing software problems • Complexity of scaling and GPU architecture will introduce bugs – Debuggers interactively examine processes and data • Fastest way to debug – with less chance of introducing more bugs – Bugs at s c a le need a debugger at s c a le • … until recently debuggers limited to ~4,000-8,000 cores – Bugs on G P Us need a debugger for G P Us • … until recently GPU software couldn't be debugged – Allinea DDT is the first graphical debugger to do both www.allinea.com

  9. DDT in a nuts he ll • Scalar features – Advanced C++ and STL – Fortran 90, 95 and 2003: modules, allocatable data, pointers, derived types – Memory debugging • Multithreading & OpenMP features – Step, breakpoint etc. one or all threads • MPI features – Easy to manage groups – Control processes by groups – Compare data – Visualize message queues www.allinea.com

  10. Me m ory De bug g ing • Find memory leaks • Or stop on read/write beyond end of array www.allinea.com

  11. S c a le Ma tte rs • Not just a rich man's Systems in Top 500 problem 100 – Used to be exclusive to big 90 labs 80 70 – Everyone is joining the fun 60 – If you can't debug problems 8k - 32k cores 32k+ cores 50 at scale, you can't fix them 40 • Historic debugger limits 30 – Linear or worse 20 performance in #cores 10 0 – Maximum size limited by 2006 2007 2008 2009 2006 2007 2008 2009 2010 patience (or desperation) Year (June & November Lists) www.allinea.com

  12. DDT: P e ta s c a le De bug g ing DDT 3.0 Performance Figures Jaguar XT5 0.12 0.1 Time (Seconds) 0.08 0.06 All Step All Breakpoint 0.04 0.02 0 0 50,000 100,000 150,000 200,000 MPI Processes • DDT delivers petascale debugging toda y – Collaborations with ORNL on Jaguar Cray XT and CEA – Tree architecture – logarithmic performance – Now faster at 220,000 than previously at 1,000 cores – ~1 / 1 0 th of a s e c ond to step and gather all stacks at 220,000 cores www.allinea.com

  13. S c a la ble P roc e s s C ontrol • Parallel Stack View – Finds rogue processes faster – Identifies classes of process behaviour – Allows rapid grouping of processes • Control Processes by Groups – Set breakpoints, step, play, stop etc. using user-defined groups – Mutates to scalable groups view – Compact group representations www.allinea.com

  14. P re s e nting Da ta , Us e fully • Gather from every node – Potentially costly – if all data different – Easy if data mostly same – New ideas • Aggregated statistics • Probabilistic algorithms optimize performance – even in pathological case • Watch this space! – With a fast and scalable architecture, new things become possible www.allinea.com

  15. T he new hotness • Hybrids are today's hottest topic – Technology is moving quickly – compilers, SDKs, hardware – NVIDIA CUDA leads in tool support • Many lines of code need rewriting for GPUs – Memory hierarchy – Explicit data transfer between host and accelerator – Unusual execution model • Kernels, thread blocks, warps, synchronization points – Massively fine-grained parallel model – It works: 1 billion keys / second on a single GTX480 • Inevitable that we need to debug! www.allinea.com

  16. Debugging Options • Old world “printf” – NVIDIA SDK now allows this (new) – but has limitations • Fake it – run the kernel on the host x86_64 processor – Languages often support targeting host CPU instead of GPU – Different numeric precision – different answer? – Different scheduling – different answer? – A reasonable option for some bugs • Or run on the GPU with Allinea DDT... www.allinea.com

  17. Introducing DDT for CUDA • The first graphical debugger for NVIDIA CUDA – Simple and easy to use – As easy as debugging ordinary code • All the commands you'd expect – Breakpoints – Stepping warps – Viewing data and thread stacks • Plus more advanced features – CUDA memcheck – memory debugging for CUDA • More to come! www.allinea.com

  18. CUDA T hreads in DDT • Run the code – Browse source – Set breakpoints – Stop at a line of CUDA code – Stops once for each scheduled collection of blocks • Select a CUDA thread – Examine variables and shared memory – Step a warp www.allinea.com

  19. E asy to understand scale • View all threads in parallel stack view – At one glance, see all GPU and CPU threads together – Links with thread selection – Pick a tree node to select one of the CUDA threads at that location • Full MPI support – See GPU and CPU threads from multiple nodes www.allinea.com

  20. Some Common P roblems • Incorrect logic (if-statements, calculations) – Loop iteration to GPU thread analogy - threads identified by grid and block indexes – Solution: Select a thread and step with DDT; look at the local state and shared data • Cherry-pick important threads: start, end, a few interior points • Kernel bounds – getting the right grids and blocks – Incorrect kernel thread boundaries can lead to incomplete results or crashing of the kernel – Solution: Bugs will often trigger “CUDA memcheck” errors - run with DDT and CUDA memory debugging enabled – Solution: Use DDT's advanced multi-dimensional array viewer to look at data and find the missing indexes www.allinea.com

  21. Current Limitations • SDK 3.0 was a big leap forward • SDK and driver limitations – Only one GPU can be debugged per O/S (per physical node) – Cannot currently read launch failure codes (without breaking your code) – Only one warp can be stepped per GPU at any time – Cannot debug GPU part of (attach to) an already running job • Strong partnership with NVIDIA, CAPS and others is helping to extend capabilities – SDK 3.1 is much better for general computation – SDK 3.2 adds debug support for multiple GPUs per node www.allinea.com

  22. Petascale Debugging: Solved. GPU Debugging: Works. Any Questions? www.allinea.com

Recommend


More recommend