VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Score-P – A Joint Perform ance Measurem ent Run-Tim e I nfrastructure VI-HPS Team
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Score-P Infrastructure for instrumentation and performance measurements Instrumented application can be used to produce several results: Call-path profiling: CUBE4 data format used for data exchange Event-based tracing: OTF2 data format used for data exchange Online profiling: In conjunction with the Periscope Tuning Framework Supported parallel paradigms: Multi-process: MPI, SHMEM Thread-parallel: OpenMP , Pthreads Accelerator-based: CUDA, OpenCL Open Source; portable and scalable to all major HPC systems Initial project funded by BMBF Close collaboration with PRIMA project funded by DOE PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 2
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Architecture overview Vampir Scalasca CUBE TAU Periscope TAUdb Call-path profiles Event traces (OTF2) (CUBE4, TAU) Online interface Hardware counter (PAPI, rusage) Score-P measurement infrastructure Instrumentation wrapper Process-level parallelism Thread-level parallelism Accelerator-based parallelism Source code instrumentation User instrumentation (MPI, SHMEM) (OpenMP, Pthreads) (CUDA, OpenCL) Application PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 3
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Partners Forschungszentrum Jülich, Germany German Research School for Simulation Sciences, Aachen, Germany Gesellschaft für numerische Simulation mbH Braunschweig, Germany RWTH Aachen, Germany Technische Universität Darmstadt, Germany Technische Universität Dresden, Germany Technische Universität München, Germany University of Oregon, Eugene, USA PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 4
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Hands-on: NPB-MZ-MPI / BT
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Perform ance analysis steps Reference preparation for validation Program instrumentation Summary measurement collection Summary experiment scoring Summary measurement collection with filtering Summary analysis report examination Event trace collection Event trace examination & analysis PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 6
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING NPB-MZ-MPI / BT instrum entation Start in the tutorial % cd .. % make clean directory again and clean- up the build PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 7
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING NPB-MZ-MPI / BT instrum entation Edit config/ make.def to # SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS #--------------------------------------------------------------- adjust build configuration # Items in this file may need to be changed for each platform. #--------------------------------------------------------------- Modify specification of COMPFLAGS = -fopenmp compiler/ linker: MPIF77 ... #--------------------------------------------------------------- # The Fortran compiler used for MPI programs #--------------------------------------------------------------- #MPIF77 = mpif77 Uncomment the Score-P compiler wrapper # Score-P variant to perform instrumentation ... specification MPIF77 = scorep mpif77 # This links MPI Fortran programs; usually the same as ${MPIF77} FLINK = $(MPIF77) ... PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 8
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING NPB-MZ-MPI / BT instrum ented build Return to root directory % make bt-mz CLASS=W NPROCS=4 cd BT-MZ; make CLASS=W NPROCS=4 VERSION= and clean-up make: Entering directory 'BT-MZ' cd ../sys; cc -o setparams setparams.c -lm Re-build executable using ../sys/setparams bt-mz 4 W Score-P compiler wrapper mpif77 -c -O3 -fopenmp bt.f [...] cd ../common; scorep mpif77 -c -O3 -fopenmp timers.f scorep mpif77 –O3 -fopenmp -o ../bin.scorep/bt-mz_W.4 \ bt.o initialize.o exact_solution.o exact_rhs.o set_constants.o \ adi.o rhs.o zone_setup.o x_solve.o y_solve.o exch_qbc.o \ solve_subs.o z_solve.o add.o error.o verify.o mpi_setup.o \ ../common/print_results.o ../common/timers.o Built executable ../bin.scorep/bt-mz_W.4 make: Leaving directory 'BT-MZ‘ PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 9
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Measurem ent configuration: scorep-info Score-P measurements % scorep-info config-vars --full SCOREP_ENABLE_PROFILING are configured via Description: Enable profiling [...] environmental variables SCOREP_ENABLE_TRACING Description: Enable tracing [...] SCOREP_TOTAL_MEMORY Description: Total memory in bytes for the measurement system [...] SCOREP_EXPERIMENT_DIRECTORY Description: Name of the experiment directory [...] SCOREP_FILTERING_FILE Description: A file name which contain the filter rules [...] SCOREP_METRIC_PAPI Description: PAPI metric names to measure [...] SCOREP_METRIC_RUSAGE Description: Resource usage metric names to measure [... More configuration variables ...] PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 10
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING NPB-MZ-MPI / BT sum m ary m easurem ent collection Change to the directory % cd bin.scorep % export SCOREP_EXPERIMENT_DIRECTORY=scorep_bt-mz_W_4x4_sum containing the new % OMP_NUM_THREADS=4 mpirun -np 4 ./bt-mz_W.4 NAS Parallel Benchmarks (NPB3.3-MZ-MPI) - BT-MZ MPI+OpenMP \ executable before running >Benchmark it with the desired Number of zones: 4 x 4 configuration Iterations: 200 dt: 0.000800 Number of active processes: 4 Run instrumented application Use the default load factors with threads Total number of threads: 16 ( 4.0 threads/process) Calculated speedup = 15.78 Time step 1 [... More application output ...] BT-MZ Benchmark Completed. Time in seconds = 100.41 PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 11
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING NPB-MZ-MPI / BT sum m ary analysis report exam ination Creates experiment % ls bt-mz_W.4 scorep_bt-mz_W_4x4_sum directory including % ls scorep_bt-mz_W_4x4_sum profile.cubex scorep.cfg A record of the measurement configuration (scorep.cfg) The analysis report that was collated after measurement (profile.cubex) PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 12
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Congratulations!? If you made it this far, you successfully used Score-P to instrument the application analyze its execution with a summary measurement, and examine it with one the interactive analysis report explorer GUIs ... revealing the call-path profile annotated with the “Time” metric Visit counts MPI message statistics (bytes sent/ received) ... but how good was the measurement? The measured execution produced the desired valid result however, the execution took rather longer than expected! even when ignoring measurement start-up/ completion, therefore it was probably dilated by instrumentation/ measurement overhead PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 13
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Perform ance analysis steps Reference preparation for validation Program instrumentation Summary measurement collection Summary experiment scoring Summary measurement collection with filtering Summary analysis report examination Event trace collection Event trace examination & analysis PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 14
VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING NPB-MZ-MPI / BT sum m ary analysis result scoring Report scoring as textual % scorep-score scorep_bt-mz_W_4x4_sum/profile.cubex output Estimated aggregate size of event trace: 1025MB Estimated requirements for largest trace buffer (max_buf): 265MB Estimated memory requirements (SCOREP_TOTAL_MEMORY): 273MB (hint: When tracing set SCOREP_TOTAL_MEMORY=273MB to avoid intermediate flushes or reduce requirements using USR regions filters.) 1 GB total memory flt type max_buf[B] visits time[s] time[%] time/visit[us] region ALL 277,799,918 41,157,533 1284.51 100.0 31.21 ALL 265 MB per rank! USR 274,792,492 40,418,321 286.86 22.3 7.10 USR OMP 6,882,860 685,952 862.00 67.1 1256.64 OMP COM 371,956 45,944 112.21 8.7 2442.29 COM MPI 102,286 7,316 23.44 1.8 3204.09 MPI Region/ callpath classification COM MPI pure MPI functions OMP pure OpenMP regions USR user-level computation USR COM USR COM “combined” USR+ OpenMP/ MPI ANY/ ALL aggregate of all region OMP MPI USR types PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 15
Recommend
More recommend