Efficient HPC Development and Production with Allinea Tools Florent Lebeau Florent.Lebeau@arm.com 27/04/2017
Download the slides • https://goo.gl/GcNg8O
Agenda • 09:30 - 10:00: Registration • 10:00 - 10:30: Introduction and how to use Allinea tools on Salomon • 10:30 - 11:00: Maximize application efficiency • 11:00 - 12:00: Fix an application crash • 12:00 - 13:00: Lunch Break • 13:00 - 14:00: Optimize memory accesses • 14:00 - 14:45: Detect memory leaks • 14:45 - 15:00: Coffee Break • 15:00 - 15:45: Resolve workload imbalances • 15:45 - 16:00: Wrap-up and Q&A session
And now… Let's talk about us!
Example: Weather and Forecasting models
Building blocks for better science • Enable multi-physics simulations Scalability • Run larger, more accurate models • Resolve ground-breaking scientific problems • Reduce wasted resources (energy…) Efficiency • Maximize science output per $ • Minimize time to result • Pro-actively and automatically detect faults Simplicity • Provide applications on various hardware • Facilitate technical dialogue with scientists
About Allinea • Allinea: leading toolkit for HPC application developers • As of December 2016 Allinea is now part of ARM – Allinea objective: continue to be the trusted HPC Tools leader in tools across every platform • This means: – The same team will continue to work with you, our customers and partners, and the wider HPC community – Being part of ARM gives us strength to deliver on our roadmap faster – We remain 100% committed to providing cross-platform tools for HPC – Our engineering roadmap is aligned with upcoming architectures from every vendor
They trust Allinea
Where to find Allinea tools Over 65% of Top 100 HPC systems • From small to very large tools provision 6 of the Top 10 HPC systems • From 1,000 to 700,000 core tools usage Future leadership systems • Millions of cores usage
Allinea Tools • Helping maximize HPC efficiency Reduce HPC systems operating costs Resolve cutting-edge challenges Promote Efficiency (as opposed to Utilization) Transfer knowledge to HPC communities • Helping the HPC community design the best applications Reach highest levels of performance and scalability Improve scientific code quality and accuracy • Available at VSB: – Forge Supercomputing – 64 tokens – Performance Reports Supercomputing – 64 tokens
ARM HPC Tools The mission: Enable the software ecosystem for large-scale ARM systems. Current team of 50, from an initial team of 9 in July 2014 Based in Manchester and Warwick, UK. Userspace ARM Performance Open Source Research Compilers Performance Allinea Tools Libraries HPC Tools New compiler Commercially- New Identification of Parallel technology to supported BLAS, commercial issues in ARM debugger, support and LAPACK and FFT tools to deliver builds of open- profiler and evaluate next- routines optimized actionable source performance generation ARM for ARM- performance packages and analysis tools architecture. compatible improvement the upstreaming for HPC microarchitectures. advice to of fixes. software developers. www.developer.arm.com/hpc
Roadmap Update – 7.0 December 2016 • Intel Knight’s Landing high-bandwidth memory debugging • IBM Spectrum MPI support DDT • Reverse connect via gateway nodes • PAPI metrics (advanced metrics pack) • MPI_THREAD_MULTIPLE support (metrics on main thread only) • IBM Spectrum MPI support MAP • Reverse connect via gateway nodes • Workflow integration: export function-level performance data to CI tools (Jenkins, Bamboo etc) • Custom metrics – add section to your own reports caption • MPI_THREAD_MULTIPLE support (metrics on main thread only) Performance • IBM Spectrum MPI support Reports • Workflow integration: export all metrics data to CI tools (Jenkins, Bamboo etc)
Maximise Application Efficiency
Building a scientific application In your opinion, what is the most critical step? MODEL ALGORITHM(S) HIGH LEVEL BINARY APPLICATION CODE PROFILE • Science • Complexity • Libraries • Compilation • Profile • Parallelism • Data • Tune • Scalability Criticality Don’t go for code optimisation first as the profile of the application depend on the earlier steps
“Learn” with Allinea Performance Reports Very simple start-up No source code needed Fully scalable, very low overhead Rich set of metrics Powerful data analysis
Allinea Performance Reports Cheat sheet • Compile your application for production • Prefix your usual launch command with “perf - report” $ perf-report mpirun -n 8 ./myapp.exe arg1 arg2 • Open the result $ cat myapp_8p_1t_YYYY-MM-DD_HH:MM.txt $ firefox myapp_8p_1t_YYYY-MM-DD_HH:MM.html • Specify the format or the output name $ perf-report -- output=“report.csv” mpirun -n 8 ./myapp.exe arg1 arg2
Getting started for the workshop • Connect to the cluster from a terminal $> ssh – X <username>@salomon.it4i.cz • Retrieve the workshop archive $> cp /home/flebeau/allinea_workshop.tar.gz . $> tar xzvf allinea_workshop.tar.gz $> cd allinea_workshop/ • Set the environment $> module load iimpi PerformanceReports/6.0.6 Forge/7.0.2 $> export ALLINEA_LICENCE_FILE=/home/flebeau/Licence.11373 (only necessary to use the temporary licence for the workshop) OR $> . common/env.sh
Go to exercise 1 • Exercise objectives • Generate a performance report of a simple code • Find the best parameters to maximize the application efficiency – Compilation flags – Number of processes – Number of nodes • Commands to use: $> cd allinea_workshop/1_*/c or cd allinea_workshop/1_*/f90 $> make $> qsub ./job.sub # Modify the job script accordingly • Key Allinea commands $> module load PerformanceReports/6.0.6 In the job script, prefix the mpirun/srun command with perf-report $> perf-report mpirun ./wave.exe
Fix an Application Crash
Print statement debugging • The first debugger: print statements – Each process prints a message or value at defined locations – Diagnose the problem from f(x) evidence and intuition • A long slow process x – Analogous to bisection root finding • Broken at modest scale – Too much output – too many log files
Typical types of bugs • Steady and • Oh, you are dependable, debugging? I’ll be there Let me hide for you. for a sec! BOHR HEISEN BUG BUG • Chaos is my • I am buggy name and AND not you shall buggy. How fear me. about that? MANDEL SCHRODIN BUG BUG
Debugging by Discipline Debugging a problem is much easier when you can : • Make and undo changes fearlessly Use a source control (CVS, …) - • Track what you’ve tried so far - Write logbooks • Reproduce bugs with a single command - Create and use test script
Debugging by Magic Any technology sufficiently advanced is indistinguishable from magic. Unpredictable, dangerous, irresistible.
Allinea Forge One Unified Solution Scalability issue prevents from reaching performance goals Use Allinea DDT to check your code or find and fix the problem: Memory error? Deadlock? Observe and debug your code step by step Flick to Allinea MAP to check the performance Identify and optimise bottlenecks
Allinea DDT helps to understand • Run Who had a rogue behaviour ? with Allinea tools ‒ Merges stacks from processes and threads Identify a problem • Where did it happen? Gather info ‒ Allinea DDT leaps to source automatically Who, Where, How, Why • How did it happen? Fix ‒ Detailed error message given to the user ‒ Some faults evident instantly from source • Why did it happen? ‒ Unique “Smart Highlighting” ‒ Sparklines comparing data across processes
Learn your spells • Prepare the code $ mpiicc -O0 -g myapp.c – o myapp.exe • Start Allinea DDT in interactive mode $ ddt mpirun./myapp.exe arg1 arg2 • Start Allinea DDT in offline mode $ ddt --offline --output=report.html mpirun ./myapp.exe arg1 arg2 • Use reverse connect On the login node: $ ddt & (or use the remote client) In the job script to submit: ddt --connect mpirun -n 8 ./myapp.exe arg1 arg2
Allinea Remote Client • Install the Allinea Remote Client Go to : http://www.allinea.com/products/downloads/ • Connect to the cluster with the remote client Connection name: VSC Hostname: <username>@salomon.it4i.cz Remote Installation Directory: /apps/all/Forge/7.0.2/ Remote script: <leave blank> Click on “Test Remote Launch”, and if it works, click on “OK” Connect to the remote cluster through the remote client • Connect to the cluster with a terminal to submit the job to connect
Exercise: Matrix Multiplication: C = A x B + C B k k j i, j, k: loop indexes nslices = 4 A C size i Algorithm 1- Master initialises matrices A, B & C 2- Master slices the matrices A & C, sends them to slaves 3- Master and Slaves perform the multiplication 4- Slaves send their results back to Master 5- Master writes the result Matrix C in an output file
Recommend
More recommend