S9347: Performance Analysis for Large Scale GPU Applications and DL Frameworks Dr. Guido Juckeland / Robert Henschel Head Computational Science Dept. / Director of Science CommunityTools at Indiana University www.hzdr.de
Agenda What to expect from the next 80 minutes Motivation Generating profiles and trace files with Score-P Visualizing trace files with Vampir Looking into Deep Learning Frameworks 2
Disclaimer It‘s extremely easy to waste performance Poor/no GPU usage (80-90%) Bad MPI (50-90%) Total: 1% of peak (or worse) Performance tools will not “automagically” make your code faster – they just point to “areas of interest” 3
Motivation Performance Tuning 101 4
Profiling vs. Tracing Preserving the details Number of Invocations Execution Time Statistics main bar foo 0 0,5 1 1,5 2 2,5 3 3,5 4 4,5 Timelines main foo bar foo main foo bar foo Time 5
Sampling Periodic observations of your application (Pull) t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 Time bar main foo Measurement Running program is periodically interrupted to take measurement Statistical inference of program behavior Not very detailed information on highly volatile metrics Requires long-running applications Works with unmodified executables 6
Instrumentation Modify application to deliver information (Push) t 8 t 10 t 13 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 9 t 11 t 12 t 14 Time bar main foo Measurement Measurement code is inserted such that every event of interest is captured directly Advantage: Much more detailed information Disadvantage: Processing of source-code / executable necessary Large relative overheads for small functions 7
Sampling vs. Tracing Comparing both approaches visually main calculate calculate calculate Function Instrumen- add f add f add f tation: f f f f f f f B a T f e r e u c r main calculate calculate calculate Sampling: add f add f add f f f f f f f f B a T f e r u e c r 8
Sampling + Instrumentation Combining the best of both worlds S S p m S p m p m e e e a a a l l l main calculate calculate calculate add f add f add f f f f MPI MPI MPI f B a T f e r u e c r Long running applications: Requires large buffers or heavy filtering Creating a filter requires runs in advance Codes with many small functions (e.g.: C++): Function instrumentation a challenge Score-P: Sampling+Tracing 9
Terms and How They Relate Making sure we use the same words Profiling Tracing Data Profiles Timelines Presentation Data Summarization Logging Recording Data Event-based Sampling Instrumentation Acquisition Analysis Layer Analysis Technique 10
Summary Making the “right” choices 11
Generating Traces and Profiles with Score-P 12
Overall workflow Recording and studying performance data Application Application Performance Visualization Trace Core Data Score-P Score-P Attach Score-P to application Run with attached monitor ==> trace/profile data Study trace with Vampir / profile with Cube Repeat to: Adapt instrumentation (“what you measure”) Evaluate result of a change 13
Attaching Score-P a.k.a. instrumenting your source code CC = pgcc CC = scorep <options> pgcc CC = pgcc CC = scorep <options> pgcc CXX = pgCC CXX = scorep <options> pgCC CXX = pgCC CXX = scorep <options> pgCC F90 = pgf90 F90 = scorep <options> pgf90 F90 = pgf90 F90 = scorep <options> pgf90 MPICC = mpicc MPICC = scorep <options> mpicc MPICC = mpicc MPICC = scorep <options> mpicc NVCC = nvcc NVCC = scorep <options> nvcc NVCC = nvcc NVCC = scorep <options> nvcc $ scorep --help This is the Score-P instrumentation tool. The usage is: scorep <options> <original command> Common options are: ... --instrument-filter=<file> Specifies the filter file for filtering functions during compile-time. It applies the same syntax, as the one used by Score-P during run-time. --user Enables user instrumentation. 14
Attaching Score-P Instrument once – change measurement via runtime variables $ scorep-info config-vars --full SCOREP_ENABLE_PROFILING [...] SCOREP_ENABLE_TRACING [...] SCOREP_TOTAL_MEMORY Description: Total memory in bytes for the measurement system [...] SCOREP_EXPERIMENT_DIRECTORY Description: Name of the experiment directory [...] $ export SCOREP_ENABLE_PROFILING=true $ export SCOREP_ENABLE_TRACING=false Profiling Example $ export SCOREP_EXPERIMENT_DIRECTORY=profile $ mpirun <instrumented binary> 15
Combined Sampling+Tracing Available since Score-P 2.0 $ export SCOREP_ENABLE_TRACING=true $ export SCOREP_ENABLE_UNWINDING=true $ export SCOREP_SAMPLING_EVENTS=perf_cycles@2000000 User code is sampled (pull) Runtime libraries with tracing support use events (push): MPI OpenMP / OpenACC / pthreads CUDA / OpenCL I/O 16
Things to look at What can Score-P record? Appli- User Functions Parallel Paradigms Hardware User Functions Parallel Paradigms Hardware Run on HPC − C/C++/Fortran − MPI − Performance − C/C++/Fortran − MPI − Performance cation system − Sampling *NEW* − Pthreads counters (PAPI) − Sampling *NEW* − Pthreads counters (PAPI) − Custom regions − OpenMP − Plugin counters − Custom regions − OpenMP − Plugin counters − XeonPhi Native *NEW* − XeonPhi Native *NEW* − Java − CUDA − Java − CUDA Results Operating Score-P Operating − Python − OpenACC/OpenCL *NEW* − Python − OpenACC/OpenCL *NEW* System System (*Experimenal*) − OpenShmem (+Cray) (*Experimenal*) − OpenShmem (+Cray) − Resource usage − Resource usage Performance Measurement − I/O (*Experimental*) − I/O (*Experimental*) (Profjle/Trace) 17
GPU Tracing Example CUDA and OpenACC $ export SCOREP_ENABLE_TRACING=yes $ export SCOREP_TIMER=clock_gettime $ export SCOREP_CUDA_ENABLE=driver,kernel,memcpy,flushatexit $ export SCOREP_OPENACC_ENABLE=yes $ export ACC_PROFLIB=$SCOREP_LIB/libscorep_adapter_openacc_event.so Can be used in combination Also supports CUPTI counters 18
Limitations Why tracing is hard Application Application Performance Visualization Trace CPU Data Score-P Score-P Temporarily stored in main memory Adds Overhead at runtime Limited size => Overhead must be low for meaningful performance analysis Event tracing requires trade-offs: Only add the data sources you need Limit granularity (i.e., filtering) Score-P is a profiling experiment 19
DEMO: Generating Traces and Profiles with Score-P 20
Visualizing Profiles with CUBE Traces with Vampir 21
Bringing it all together Score-P + Analysis Tools Vampir Scalasca CUBE TAU Periscope TAUdb Call-path profiles Event traces (OTF2) (CUBE4, TAU) Online interface Hardware counter (PAPI, rusage) Score-P measurement infrastructure Instrumentation wrapper Accelerator-based Process-level parallelism Thread-level parallelism Source code parallelism User instrumentation (MPI, SHMEM) (OpenMP, Pthreads) instrumentation (CUDA, OpenCL, OpenACC ) Application 22
CUBE Interactive profile analysis How is it What kind of Where is it in the distributed across performance source code? the processes/threads? metric? In what context? 23
Vampir Interactive trace analysis >50% time wasted Large imbalance instantly visible 24
Vampir Performance data visualization in a complex environment I/O Compute Nodes Login Dekstop I/O Compute Nodes Login Dekstop System (Batch jobs) Nodes System System (Batch jobs) Nodes System Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Trace Core Core Core Core Core Core Core Core Core File Core Core Core Core (OTF2) Core Core Core Core Core Core Core Core Core Core Core Core Core 25
Simplest Approach Use your destop system I/O Compute Nodes Login Dekstop I/O Compute Nodes Login Dekstop System (Batch jobs) Nodes System System (Batch jobs) Nodes System Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Trace Core Core Core Core Core Core Core Core Core File Core Core Core Core (OTF2) + Minimal setup (no installations, no batch job) Core Core Core Core Core Visualization and Core Core Core Core - Copying of traces to desktop analysis : Core Core Core Core Vampir - Only small traces 26
Recommend
More recommend