Parallel Performance Optimization ASD Shared Memory HPC Workshop Computer Systems Group, ANU Research School of Computer Science Australian National University Canberra, Australia February 13, 2020
Schedule - Day 4 Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 2 / 63
NUMA systems Outline 3 Intel TBB 1 NUMA systems 4 Lock Free Synchronization 2 Profiling Codes 5 Transactional Memory Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 3 / 63
NUMA systems Non-Uniform Memory Access: Basic Ideas NUMA means there is some hierarchy in main memory system’s structure all memory is available to the programmer (single address space), but some memory takes longer to access than others modular memory systems with interconnects: UMA/NUMA vs NUMA MEMORY MEMORY MEMORY MEMORY I N T E R C O N N E C T I N T E R C O N N E C T MEMORY MEMORY MEMORY MEMORY Cache Cache Cache Cache Cache Cache Cache Cache CORE1 CORE2 CORE1 CORE2 CORE1 CORE2 CORE1 CORE2 CORE1 CORE2 CORE1 CORE2 CORE1 CORE2 CORE1 CORE2 on a NUMA system, there are two effects that may be important: thread affinity : once a thread is assigned to a core, ensure that it stays there NUMA affinity : memory used by a process must only be allocated on the socket of the core that it is bound to Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 4 / 63
NUMA systems Examples of NUMA Configurations 4-socket Opteron; note ex- tra NUMA level within a socket! Intel Xeon5500, with QPI (courtesy qdpma.com) (courtesy qdpma.com) Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 5 / 63
NUMA systems Examples of NUMA Configurations (II) 8-way ‘glueless’ system (processors are directly connected) (courtesy qdpma.com) Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 6 / 63
NUMA systems Case Study: Why NUMA Matters MetUM global atmosphere model, 1024 × 769 × 70 grid on an NCI Intel X5570 - Infiniband supercomputer (2011): Effect of Process and NUMA Affinity on Scaling note differing values for t16! on the X5570, local:remote memory access is 65:105 cycles indicates a significant amount of L3$ misses Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 7 / 63
NUMA systems Case Study: Why NUMA Matters (II) Time breakdown for no NUMA affinity, 1024 processes (dual socket nodes, 4 cores per socket) Note spikes in compute times were always in groups of 4 processes (e.g. socket 0) Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 8 / 63
NUMA systems Process and Thread Affinity in general, the OS is free to decide which core (virtual CPU) a process or thread (next) runs on we can restrict which CPUs it will run on by specifying an affinity mask of the CPU ids it may be scheduled to run on this has 2 benefits (assuming other active processes/threads are excluded from the specified CPUs): ensure maximum speed for that process/thread minimize cache / TLB pollution caused by context switches e.g. on an 8-CPU system, create 8 threads to run on different CPUs: 1 pthread_t threadHandle [8]; cpu_set_t cpu; for (int i = 0; i < 8; i++) { 3 pthread_create (& threadHandle [i], NULL , threadFunc , NULL); CPU_ZERO (& cpu); CPU_SET(i, &cpu); 5 pthread_setaffinity_np ( threadHandle [i], sizeof(cpu_set_t), &cpu); } for a process, it is similar: sched_setaffinity (getpid (), sizeof(cpu_set_t), &cpu); Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 9 / 63
NUMA systems NUMActl: Controlling NUMA from the Shell on a NUMA system, we generally wish to bind a process and its memory image to a particular ‘node’ (= NUMA domain ) the NUMA API provides a way of controlling policies of memory allocation on a per node or per process basis policies are default , bind , interleave , preferred run a program on a CPU on node 0, with all memory allocated on node 0: 1 numactl --membind =0 --cpunodebind =0 ./ prog -args similar, but force to be run on CPU 0 (which must be on node 0): 1 numactl --physcpubind =0 --membind =0 ./ prog ./ args optimize bandwidth for a crucial program to utilize multiple memory controllers (at expense of other processes!) 1 numactl --interleave =all ./ memhog ... numactl --hardware shows available nodes etc Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 10 / 63
NUMA systems LibNUMA: Controlling NUMA from within a Program with libnuma , we can similarly change (the current thread of) an executing process’s node affinity and memory allocation policy run from now on on a CPU on node 0, with all memory allocated on node 0: nodemask_t mask; numa_run_on_node (0); 2 nodemask_zero (& mask); nodemask_set (&mask , 0); 2 numa_set_preferred (0); numa_bind (& mask); to allow it to run on all nodes again: 1 numa_run_on_node_mask (& numa_all_nodes ); execute a memory hogging function, with all its (new) memory fully interleaved, and then restore to previous state: 1 numamask_t prevmask = numa_get_interleave_mask (); numa_set_interleave_mask (& numa_all_nodes ); 3 memhog (...); numa_set_interleave_mask (& prevmask); Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 11 / 63
NUMA systems Hands-on Exercise: NUMA Effects Objective: Explore the effects of Non-Uniform Memory Access (NUMA), that is the general benefit of ensuring a process and its memory are in the same NUMA domain Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 12 / 63
Profiling Codes Outline 3 Intel TBB 1 NUMA systems 4 Lock Free Synchronization 2 Profiling Codes 5 Transactional Memory Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 13 / 63
Profiling Codes Profiling: Basics Profiling is the process of recording information during execution of a program to form an aggregate view of its dynamic behaviour Compare with tracing , which records an ordered log of events that can be used to reconstruct dynamic behaviour Used to understand program performance and find bottlenecks At certain points in execution, record program state (instruction pointer, calling context, hardware performance counters, ...) Sampling (recurrent event trigger) vs. instrumentation (probes at specific points in program) Real vs. simulated execution Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 14 / 63
Profiling Codes Sampling CPU program counter Program Main cycle counter ... cache miss counter end Main Function Asterix (...) flop counter ... end Asterix Function Obelix (...) ... end Obelix Function Table ... interrupt every 10 ms Main Asterix Obelix + add and reset counter EuroMPI‘12: Hands When event trigger occurs, record instruction pointer (+ call context) and performance counters: low overhead but subject to sampling error Source: EuroMPI’12: Introduction to Performance Engineering Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 15 / 63
Profiling Codes Instrumentation CPU cache miss counter ... Function Obelix (...) call monitor(“Obelix“, “enter“) monitor(routine, location) ... if (“enter“) then call monitor(“Obelix“,“exit“) end Obelix ... else end if Function Table Main Asterix Obelix 1490 1300 + 200 - 10 EuroMPI‘12: Hands Inject ‘trampolines’ into source or binary code: accurate but higher overhead Source: EuroMPI’12: Introduction to Performance Engineering Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 16 / 63
Profiling Codes The 80/20 Rule & Life Cycle Programs typically spend 80% of their time in 20% of the code Programmers typically spend 20% of their effort to get 80% of the possible speedup → optimize for the common case Measurement Coding Refinement Analysis Performance Analysis Program Tuning Ranking Production Source: EuroMPI’12: Introduction to Performance Engineering Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 17 / 63 EuroMPI‘12: Hands
Profiling Codes perf and VTune Perf is a profiler tool for Linux 2.6+ based systems that abstracts away CPU hardware differences in Linux performance measurements and presents a simple command-line interface. Perf is based on the perf events interface exported by recent versions of the Linux kernel. Intel’s VTune is a commercial-grade profiling tool for complex applications via the command-line or GUI. Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 18 / 63
Profiling Codes perf Reference Material http://www.brendangregg.com/perf.html Perf for User-Space Program Analysis Perf Wiki Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 19 / 63
Profiling Codes VTune Reference Material https://software.intel.com/en-us/intel-vtune-amplifier-xe Documentation URL Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 20 / 63
Profiling Codes perf for Linux perf is both a kernel syscall interface and a collection of tools to collect, analyze and present hardware performance counter data either via counting or sampling Computer Systems (ANU) Parallel Perf Optimization Feb 13, 2020 21 / 63
Recommend
More recommend