De bug g ing L a rg e S c a le a nd Hybrid P a ra lle l C ode Ma - PowerPoint PPT Presentation

De bug g ing L a rg e S c a le a nd Hybrid P a ra lle l C ode Ma rk O'C onnor m a rk@ a lline a .c om L e a d De ve lope r www.allinea.com

Didn't a s k for pa ra lle lis m www.allinea.com

... but g ot pa ra lle lis m www.allinea.com

Inte re s ting tim e s ... • A recent history of parallel computing – Increasing core counts – Increasing cores per node – Established programming models • Near-100% of HPC software using MPI and/ or OpenMP • A close match of software to hardware - good portability – The challenge of a pplic a tion s c a la bility remains! • Times change ... – GPUs entering HPC – power, performance, … – Massive multi-core clusters – with many GPUs – The challenge of hybrid s oftw a re is here! www.allinea.com

E xploiting te c hnolog y • “ S o ftw are has b e c o m e the #1 ro ad b lo c k … M a ny ap p lic atio ns w ill ne e d a m ajo r re d e s ig n” - IDC HPC Update, June 2010 – Most ISV codes do not scale – High programming costs are delaying GPU usage • Development tools are a vital part of the solution www.allinea.com

Alline a S oftw a re • UK based HPC tools company since 2001 – Allinea DDT – the scalable parallel debugger – Allinea OPT – the optimization tool for MPI and non-MPI – Allinea DDTLite – the parallel debugging plugin for Microsoft Visual Studio • Large European and US customer base – Ease of use – means tools get used – Users debugging regularly at all scales – at 1 or 100,000 cores – World's only Petascale debugger! www.allinea.com

S om e C lie nts a nd P a rtne rs • Academic – Over 200 universities • Major research centres – ANL, EPCC, IDRIS, Juelich, NERSC, ORNL, • Aviation and Defence – Airbus, AWE, Dassault, DLR, EADS, ... • Energy – CEA, CGG Veritas, IFP , Total, ... • EDA – Cadence, Intel, Synopsys, ... • Climate and Weather – UK Met Office, Meteo France, NOAA ... www.allinea.com

B a c kg round • Debugging: Good (aka A Necessary Evil) – Reproducing and fixing software problems • Complexity of scaling and GPU architecture will introduce bugs – Debuggers interactively examine processes and data • Fastest way to debug – with less chance of introducing more bugs – Bugs at s c a le need a debugger at s c a le • … until recently debuggers limited to ~4,000-8,000 cores – Bugs on G P Us need a debugger for G P Us • … until recently GPU software couldn't be debugged – Allinea DDT is the first graphical debugger to do both www.allinea.com

DDT in a nuts he ll • Scalar features – Advanced C++ and STL – Fortran 90, 95 and 2003: modules, allocatable data, pointers, derived types – Memory debugging • Multithreading & OpenMP features – Step, breakpoint etc. one or all threads • MPI features – Easy to manage groups – Control processes by groups – Compare data – Visualize message queues www.allinea.com

Me m ory De bug g ing • Find memory leaks • Or stop on read/write beyond end of array www.allinea.com

S c a le Ma tte rs • Not just a rich man's Systems in Top 500 problem 100 – Used to be exclusive to big 90 labs 80 70 – Everyone is joining the fun 60 – If you can't debug problems 8k - 32k cores 32k+ cores 50 at scale, you can't fix them 40 • Historic debugger limits 30 – Linear or worse 20 performance in #cores 10 0 – Maximum size limited by 2006 2007 2008 2009 2006 2007 2008 2009 2010 patience (or desperation) Year (June & November Lists) www.allinea.com

DDT: P e ta s c a le De bug g ing DDT 3.0 Performance Figures Jaguar XT5 0.12 0.1 Time (Seconds) 0.08 0.06 All Step All Breakpoint 0.04 0.02 0 0 50,000 100,000 150,000 200,000 MPI Processes • DDT delivers petascale debugging toda y – Collaborations with ORNL on Jaguar Cray XT and CEA – Tree architecture – logarithmic performance – Now faster at 220,000 than previously at 1,000 cores – ~1 / 1 0 th of a s e c ond to step and gather all stacks at 220,000 cores www.allinea.com

S c a la ble P roc e s s C ontrol • Parallel Stack View – Finds rogue processes faster – Identifies classes of process behaviour – Allows rapid grouping of processes • Control Processes by Groups – Set breakpoints, step, play, stop etc. using user-defined groups – Mutates to scalable groups view – Compact group representations www.allinea.com

P re s e nting Da ta , Us e fully • Gather from every node – Potentially costly – if all data different – Easy if data mostly same – New ideas • Aggregated statistics • Probabilistic algorithms optimize performance – even in pathological case • Watch this space! – With a fast and scalable architecture, new things become possible www.allinea.com

T he new hotness • Hybrids are today's hottest topic – Technology is moving quickly – compilers, SDKs, hardware – NVIDIA CUDA leads in tool support • Many lines of code need rewriting for GPUs – Memory hierarchy – Explicit data transfer between host and accelerator – Unusual execution model • Kernels, thread blocks, warps, synchronization points – Massively fine-grained parallel model – It works: 1 billion keys / second on a single GTX480 • Inevitable that we need to debug! www.allinea.com

Debugging Options • Old world “printf” – NVIDIA SDK now allows this (new) – but has limitations • Fake it – run the kernel on the host x86_64 processor – Languages often support targeting host CPU instead of GPU – Different numeric precision – different answer? – Different scheduling – different answer? – A reasonable option for some bugs • Or run on the GPU with Allinea DDT... www.allinea.com

Introducing DDT for CUDA • The first graphical debugger for NVIDIA CUDA – Simple and easy to use – As easy as debugging ordinary code • All the commands you'd expect – Breakpoints – Stepping warps – Viewing data and thread stacks • Plus more advanced features – CUDA memcheck – memory debugging for CUDA • More to come! www.allinea.com

CUDA T hreads in DDT • Run the code – Browse source – Set breakpoints – Stop at a line of CUDA code – Stops once for each scheduled collection of blocks • Select a CUDA thread – Examine variables and shared memory – Step a warp www.allinea.com

E asy to understand scale • View all threads in parallel stack view – At one glance, see all GPU and CPU threads together – Links with thread selection – Pick a tree node to select one of the CUDA threads at that location • Full MPI support – See GPU and CPU threads from multiple nodes www.allinea.com

Some Common P roblems • Incorrect logic (if-statements, calculations) – Loop iteration to GPU thread analogy - threads identified by grid and block indexes – Solution: Select a thread and step with DDT; look at the local state and shared data • Cherry-pick important threads: start, end, a few interior points • Kernel bounds – getting the right grids and blocks – Incorrect kernel thread boundaries can lead to incomplete results or crashing of the kernel – Solution: Bugs will often trigger “CUDA memcheck” errors - run with DDT and CUDA memory debugging enabled – Solution: Use DDT's advanced multi-dimensional array viewer to look at data and find the missing indexes www.allinea.com

Current Limitations • SDK 3.0 was a big leap forward • SDK and driver limitations – Only one GPU can be debugged per O/S (per physical node) – Cannot currently read launch failure codes (without breaking your code) – Only one warp can be stepped per GPU at any time – Cannot debug GPU part of (attach to) an already running job • Strong partnership with NVIDIA, CAPS and others is helping to extend capabilities – SDK 3.1 is much better for general computation – SDK 3.2 adds debug support for multiple GPUs per node www.allinea.com

Petascale Debugging: Solved. GPU Debugging: Works. Any Questions? www.allinea.com

De bug g ing L a rg e S c a le a nd Hybrid P a ra lle l C ode Ma - PowerPoint PPT Presentation

De bug g ing L a rg e S c a le a nd Hybrid P a ra lle l C ode Ma rk O'C onnor m a rk@ a lline a .c om L e a d De ve lope r www.allinea.com Didn't a s k for pa ra lle lis m www.allinea.com ... but g ot pa ra lle lis m www.allinea.com

Stude nt Suc c e ss Sc ore c a rd Stude nt Suc c e ss Sc ore c a rd I I rvine Va lle y Co lle

Spelling, Punctuation and Grammar Suffixes -ing Year One SPaG | Suffixes -ing Suffixes Suffixes

Industrial Bug Mining Industrial Bug Mining Extracting, Grading and Enriching the Ore of Exploits

Hybrid Construction Hybrid Construction Hybrid Construction Hybrid Construction 1 VP

Wha t is Co lle g e Cre dit Plus (CCP)? Why CCP a t T ri-C? ACCE SS Be ne fits a nd Cha lle

VACUUM E XCE LLE NCE DE FINE D VACUUM E XCE LLE NCE DE FINE D Cutting

T he US Cybe r Cha lle ng e U S Cyb e r Cha lle ng e : De ve lo p ing the Ne xt Ge ne ra

Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model for t he Dist

Hybrid Automobiles Hybrid Automobiles It switches easily between fuel, batteries, or both It

Fedora Bug Triage John "poelcat" Poelstra Jon "jds2001" Stanley June 21,

Bug Driven Bug Finding Chadd C. Williams Jeffrey K. Hollingsworth University of Maryland

3/3/15 Announcement: Bug of the week (extra credit) Architectural Patterns Each group can

Bugzilla, Bug-squad and GNOME3 Presented By Akhil Laddha 1 Agenda About me Bugzilla Bug

Open Source Bug Fixes: Characterization and Dataset Prediction Data Collection Bug

L a Sa lle Ba nk & T rust # 5200 WH F to PUD fo r a n Anima l Ho spita l* L a Sa lle

EXPO REAL Hybrid Summit Your virtual exhibition EXPO REAL Hybrid Summit The Hybrid Conference

Tizen Platform SDK: The Easy Way to Develop Tizen Platform Donghyuk Yang, Donghee Yang,

DEBUGGER CSSE 120 Rose-Hulman Institute of Technology Integrated Development Environments

for In-System Debug of High-Level Synthesis Circuits Jeffrey Goeders Steve Wilton 1 What this

Debugging Debugging CISC 323 Winter 2006 Prof. Lamb Prof. Kelly malamb@cs.queensu.ca

Observing Facts Andreas Zeller 1 Reasoning about Runs Experimentation n controlled runs

Fireiron - A Scheduling Language for GPUs. Vinod Grover | Dec 5, 2019 Acknowledgments Joint work

How and Why eFPGA Will Become Pervasive Over the Next Decade D&R IP-SoC Grenoble 6 December

Com Computation onal Structures in Data Science Le Lecture 14: 4: UC Berkeley EECS Lecturer

De bug g ing L a rg e S c a le a nd Hybrid P a ra lle l C ode Ma - PowerPoint PPT Presentation

De bug g ing L a rg e S c a le a nd Hybrid P a ra lle l C ode Ma rk O'C onnor m a rk@ a lline a .c om L e a d De ve lope r www.allinea.com Didn't a s k for pa ra lle lis m www.allinea.com ... but g ot pa ra lle lis m www.allinea.com

Stude nt Suc c e ss Sc ore c a rd Stude nt Suc c e ss Sc ore c a rd I I rvine Va lle y Co lle

Spelling, Punctuation and Grammar Suffixes -ing Year One SPaG | Suffixes -ing Suffixes Suffixes

Industrial Bug Mining Industrial Bug Mining Extracting, Grading and Enriching the Ore of Exploits

Hybrid Construction Hybrid Construction Hybrid Construction Hybrid Construction 1 VP

Wha t is Co lle g e Cre dit Plus (CCP)? Why CCP a t T ri-C? ACCE SS Be ne fits a nd Cha lle

VACUUM E XCE LLE NCE DE FINE D VACUUM E XCE LLE NCE DE FINE D Cutting

T he US Cybe r Cha lle ng e U S Cyb e r Cha lle ng e : De ve lo p ing the Ne xt Ge ne ra

Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model for t he Dist

Hybrid Automobiles Hybrid Automobiles It switches easily between fuel, batteries, or both It

Fedora Bug Triage John &quot;poelcat&quot; Poelstra Jon &quot;jds2001&quot; Stanley June 21,

Bug Driven Bug Finding Chadd C. Williams Jeffrey K. Hollingsworth University of Maryland

3/3/15 Announcement: Bug of the week (extra credit) Architectural Patterns Each group can

Bugzilla, Bug-squad and GNOME3 Presented By Akhil Laddha 1 Agenda About me Bugzilla Bug

Open Source Bug Fixes: Characterization and Dataset Prediction Data Collection Bug

L a Sa lle Ba nk &amp; T rust # 5200 WH F to PUD fo r a n Anima l Ho spita l* L a Sa lle

EXPO REAL Hybrid Summit Your virtual exhibition EXPO REAL Hybrid Summit The Hybrid Conference

Tizen Platform SDK: The Easy Way to Develop Tizen Platform Donghyuk Yang, Donghee Yang,

DEBUGGER CSSE 120 Rose-Hulman Institute of Technology Integrated Development Environments

for In-System Debug of High-Level Synthesis Circuits Jeffrey Goeders Steve Wilton 1 What this

Debugging Debugging CISC 323 Winter 2006 Prof. Lamb Prof. Kelly malamb@cs.queensu.ca

Observing Facts Andreas Zeller 1 Reasoning about Runs Experimentation n controlled runs

Fireiron - A Scheduling Language for GPUs. Vinod Grover | Dec 5, 2019 Acknowledgments Joint work

How and Why eFPGA Will Become Pervasive Over the Next Decade D&amp;R IP-SoC Grenoble 6 December

Com Computation onal Structures in Data Science Le Lecture 14: 4: UC Berkeley EECS Lecturer

Fedora Bug Triage John "poelcat" Poelstra Jon "jds2001" Stanley June 21,

L a Sa lle Ba nk & T rust # 5200 WH F to PUD fo r a n Anima l Ho spita l* L a Sa lle

How and Why eFPGA Will Become Pervasive Over the Next Decade D&R IP-SoC Grenoble 6 December