Introduction to Performance Analysis Visualization and Analysis of Performance on Large-scale Software VAPLS 2013 Todd Gamblin Katherine Isaacs UC Davis, LLNL LLNL
Why does my code run slowly? Bad algorithm 1. • Poor computational complexity, poor performance Takes poor advantage of the machine 2. • Code does not use hardware resources efficiently • Different code may take better or worse advantage of different types of hardware • Many factors can contribute — CPU, memory, threading, network, I/O It just has a lot of work to do 3. • Already using best algorithm • Maps well to machine Distinguishing between these scenarios is difficult! Lawrence Livermore National Laboratory
Profiling a single process Profiling is one of the most fundamental forms of performance analysis Source Code Measures how much time is spent in particular functions in the code • May include calling context • May map time to particular files/line numbers Elapsed Time Helps programmers locate the bottleneck • Amdahl’s law: figure out what needs to be sped up. Functions Screenshot from HPCToolkit Lawrence Livermore National Laboratory
How do we measure hardware? Modern chips offer special Hardware Performance Counters: • Event counters, e.g.: — Number of FP, integer, memory, etc. instructions — Number of L1, L2 cache misses — Number of pipeline stalls — Only so many counters can be measured at once. • Instruction Sampling — Precise latency and memory access information — Operands and other metadata for particular instructions — Newer chips support this Counters provide useful diagnostic information • Can explain why a particular region consumes lots of time. • Generally need to attribute counters to source code or another domain first. Lawrence Livermore National Laboratory
Understanding processor complexity Once you’ve identified the “hot spot”, how do you know what the 17 Processor Cores problem is? • Have to dig deeper into hardware • Understand how code interacts with the architecture Processors themselves are parallel: • Multiple functional units Shared L2 Cache • Multiple instructions issued per clock cycle • SIMD (vector) instructions • Hyperthreading 17-core Blue Gene/Q SOC Can code exploit these? Lawrence Livermore National Laboratory
Understanding memory Threads and processes on a single node Core 0 Core 1 Core 2 Core 3 communicate through shared memory L1 Cache L1 Cache L1 Cache L1 Cache L2 Cache Memory is hierarchical Processor 0 • Many levels of cache • Different access speeds Main Memory • Different levels of sharing in cache Lawrence Livermore National Laboratory
Understanding memory Access to main memory Memory 0 Memory 1 may not have uniform C0 C1 C2 C3 C0 C1 C3 C2 access time L1 L1 L1 L1 L1 L1 L1 L1 • More cores means uniform latency is hard to maintain L2 L2 Processor 0 Processor 1 Many modern processors have Non-Uniform C0 C1 C3 C2 C0 C1 C3 C2 Memory Access latency L1 L1 L1 L1 (NUMA) L1 L1 L1 L1 L2 L2 • Time to access remote Processor 3 Processor 3 sockets is longer than local Memory 2 Memory 3 ones 4-socket, 16-core NUMA node Lawrence Livermore National Laboratory
Modern supercomputers are composed of many processors Tianhe-2 in China • 3.1 million Intel Xeon Phi cores • Fat tree network Titan at ORNL • 560,000 AMD Opteron cores • 18,688 Nvidia GPUs • 3D Torus/mesh network IBM Blue Gene/Q at LLNL • 1.5 million PowerPC A2 cores • 98,000 network nodes x 16 cores • 5D Torus/mesh network Lawrence Livermore National Laboratory
Processors pass messages over a network Each node in the network is a multi- core processor Programs pass messages over the network Many topologies: 4-D Torus network topology • Fat Tree • Cartesian (Torus/Mesh) • Dragonfly – Multiple routing options for each one! Most recent networks have extensive performance counters • Measure bandwidth on links • Measure contention on node Fat Tree Network Topology Lawrence Livermore National Laboratory
Tracing in a message-passing application Tracing records all function calls and messages • Can record along with counters • Large volume of records • Clocks may not be synchronized Identify causes and propagation of delays Log behavior of adaptive algorithms in practice Screenshot from Vampir Lawrence Livermore National Laboratory
Understanding parallel performance requires mapping hardware measurements to intuitive domains Map hardware events to source code, data structures, etc. • Understand why performance is bad • Take action based on what the hardware data correlates to Most programmers look at a small fraction of hardware data • Automated visualization and analysis could help leverage the data Lawrence Livermore National Laboratory
Tools for collecting performance measurements HPCToolkit: hpctoolkit.org mpiP: mpip.sourceforge.net Open|Speedshop: openspeedshop.org Paraver: www.bsc.es/computer-sciences/performance-tools/paraver P n MPI: scalability.llnl.gov/pnmpi/ Scalasca: www.scalasca.org ScalaTrace: moss.csc.ncsu.edu/~mueller/ScalaTrace/ TAU: www.cs.uoregon.edu/research/tau/ VampirTrace: www.tu-dresden.de/zih/vampirtrace …and many more! Lawrence Livermore National Laboratory
Recommend
More recommend