Understanding applications with Paraver tools@bsc.es 2018
Our Tools • Since 1991 • Based on traces • Open Source • http://tools.bsc.es • Core tools: Paraver (paramedir) – offline trace analysis • Dimemas – message passing simulator • Extrae – instrumentation • • Focus Detail, variability, flexibility • Behavioral structure vs. syntactic structure • Intelligence: Performance Analytics •
Paraver
Paraver – Performance data browser Trace visualization/analysis Raw data + trace manipulation Timelines Goal = Flexibility No semantics Programmable 2/3D tables (Statistics) Comparative analyses Multiple traces Synchronize scales
From timelines to tables MPI calls profile MPI calls Histogram Useful Duration Useful Duration
Analyzing variability Useful Duration IPC Instructions L2 miss ratio
Analyzing variability • By the way: six months later …. Useful Duration IPC Instructions L2 miss ratio
From tables to timelines CESM: 16 processes, 2 simulated days • Histogram useful computation duration shows high variability • How is it distributed? • Dynamic imbalance • In space and time • Day and night. • Season ?
Trace manipulation • Data handling/summarization capability • Filtering WRF-NMM 570 s Peninsula 4km • Subset of records in original trace 2.2 GB 128 procs MPI, HWC • By duration, type, value,… • Filtered trace IS a paraver trace and can be 570 s 5 MB analysed with the same cfgs (as long as needed data kept) • Cutting • All records in a given time interval • Only some processes 4.6 s • Software counters 36.5 MB • Summarized values computed from those in the original trace emitted as new even types • #MPI calls, total hardware count,…
Extrae
Extrae features • Platforms • Intel, Cray, BlueGene, MIC, ARM, Android, Fujitsu Sparc … • Parallel programming models • MPI, OpenMP, pthreads, OmpSs, CUDA, OpenCL, Java, Python… No need • Performance Counters to • Using PAPI interface recompile / relink! • Link to source code • Callstack at MPI routines • OpenMP outlined routines • Selected user functions (Dyninst) • Periodic sampling • User events (Extrae API) 11
Extrae overheads Average values Event 150 – 200 ns Event + PAPI 750 ns – 1.5 us Event + callstack (1 level) 1 us Event + callstack (6 levels) 2 us 12
How does Extrae work? • Symbol substitution through LD_PRELOAD • Specific libraries for each combination of runtimes • MPI Recommended • OpenMP • OpenMP+MPI • … • Dynamic instrumentation • Based on Dyninst (developed by U.Wisconsin / U.Maryland) • Instrumentation in memory • Binary rewriting • Alternatives • Static link (i.e., PMPI, Extrae API) 13
Extrae XML configuration <mpi enabled="yes"> <counters enabled="yes" /> Trace the MPI calls </mpi> ( What’s the program doing?) <openmp enabled="yes"> <locks enabled="no" /> <counters enabled="yes" /> </openmp> <pthread enabled="no"> <locks enabled="no" /> <counters enabled="yes" /> Trace the call-stack </pthread> (Where in my code?) <callers enabled="yes"> <mpi enabled="yes">1-3</mpi> <sampling enabled="no">1-5</sampling> </callers> 14
Extrae XML configuration (II) <counters enabled="yes"> <cpu enabled="yes" starting-set-distribution =“1"> <set enabled="yes" domain="all" changeat-time="500000us"> PAPI_TOT_INS, PAPI_TOT_CYC, PAPI_L1_DCM, PAPI_L2_DCM, PAPI_L3_TCM </set> <set enabled="yes" domain="all" changeat-time="500000us"> PAPI_TOT_INS, PAPI_TOT_CYC, PAPI_BR_MSP, PAPI_BR_UCN, PAPI_BR_CN, RESOURCE_STALLS Select which </set> <set enabled="yes" domain="all" changeat-time="500000us"> HW counters PAPI_TOT_INS, PAPI_TOT_CYC, PAPI_VEC_DP, PAPI_VEC_SP, are measured PAPI_FP_INS </set> ( How’s the machine doing?) <set enabled="yes" domain="all" changeat-time="500000us"> PAPI_TOT_INS, PAPI_TOT_CYC, PAPI_LD_INS, PAPI_SR_INS </set> <set enabled="yes" domain="all" changeat-time="500000us"> PAPI_TOT_INS, PAPI_TOT_CYC, RESOURCE_STALLS:LOAD, RESOURCE_STALLS:STORE, RESOURCE_STALLS:ROB_FULL, RESOURCE_STALLS:RS_FULL </set> </cpu> <network enabled="no" /> <resource-usage enabled="no" /> <memory-usage enabled="no" /> </counters> 15
Extrae XML configuration (III) <buffer enabled="yes"> Trace buffer size <size enabled="yes">500000</size> <circular enabled="no" /> (Flush/memory trade-off) </buffer> < sampling enabled="no" type="default" period="50m" variability="10m" /> < merge enabled="yes" Enable sampling synchronization="default" tree-fan-out =“16" (Want more details?) max-memory="512" joint-states="yes" keep-mpits="yes" Automatic sort-addresses="yes" post-processing overwrite=" yes“ to generate the > Paraver trace $TRACE_NAME$ </merge> 16
Dimemas
Dimemas – Coarse grain, Trace driven simulation • Simulation: Highly non linear model • MPI protocols, resource contention… B L L L CPU CPU CPU Local Local Local • Parametric sweeps CPU CPU CPU Memory Memory Memory CPU CPU CPU • On abstract architectures • On application computational regions • What if analysis • Ideal machine (instantaneous network) • Estimating impact of ports to MPI+OpenMP /CUDA/… • Should I use asynchronous communications? Impact of BW (L=8; B=0) • Are all parts equally sensitive to network? 1.20 1.00 • MPI sanity check NMM 512 0.80 ARW 512 • Modeling nominal Efficiency NMM 256 0.60 ARW 256 NMM 128 0.40 ARW 128 • Paraver – Dimemas tandem 0.20 0.00 • Analysis and prediction 1 4 16 64 256 1024 • What-if from selected time window Detailed feedback on simulation (trace)
Network sensitivity • MPIRE 32 tasks, no network contention L = 5µs – BW = 1 GB/s L = 1000µs – BW = 1 GB/s All windows same scale L = 5µs – BW = 100MB/s
Network sensitivity • WRF, Iberia 4Km, 4 procs/node Impact of latency (BW=256; B=0) • Not sensitive to latency 1.002 Speedup vs. Nominal Latency • NMM 1 NMM 512 0.998 ARW 512 • BW – 256MB/s NMM 256 0.996 ARW 256 • 512 – sensitive to contention NMM 128 0.994 ARW 128 • ARW 0.992 0.99 • BW - 1GB/s 0 2 4 8 16 32 • Sensitive to contention Contention Impact (L=8; BW=256) Impact of BW (L=8; B=0) 1.2 1.20 1 1.00 Speedup vs. Full comectivity NMM 512 NMM 512 0.8 0.80 ARW 512 ARW 512 Efficiency NMM 256 0.6 NMM 256 ARW 256 0.60 ARW 256 NMM 128 0.4 NMM 128 ARW 128 0.40 ARW 128 0.2 0.20 0 0.00 4 8 12 16 20 24 28 32 36 1 4 16 64 256 1024 Commectivity (B)
Would I will benefit from asynchronous communications? SPECFEM3D Real Ideal Prediction MN Prediction 100MB/s Prediction 10MB/s Prediction Courtesy Dimitri Komatitsch 5MB/s Prediction 1MB/s
Ideal machine The impossible machine: BW = , L = 0 • Actually describes/characterizes Intrinsic application behavior • Load balance problems? • Dependence problems? alltoall Allgather GADGET @ Nehalem cluster + allreduce sendrec 256 processes sendrecv waitall Real run Ideal network Impact on practical machines?
Impact of architectural parameters • Ideal speeding up ALL the computation bursts by the CPUratio factor • The more processes the less speedup (higher impact of bandwidth limitations) !! GADGET Speedup Speedup Speedup 140 140 140 64 128 120 120 120 256 procs procs procs 100 100 100 80 80 80 60 60 60 40 40 40 20 20 20 64 64 64 16 0 0 0 8 8 4 16384 CPU 16384 CPU 16384 CPU 8192 8192 8192 4096 4096 4096 2048 2048 2048 1024 1024 1024 1 1 1 512 ratio 512 ratio 512 ratio 256 256 256 128 128 128 64 64 64 Bandwidth (MB/s) Bandwidth (MB/s) Bandwidth (MB/s)
Hybrid parallelization Profile 40 • Hybrid/accelerator % Computation Time 35 parallelization 30 % of computation time 25 • Speed-up SELECTED regions by 20 the CPUratio factor 15 128 procs. 10 5 0 Code regions 1 2 3 4 5 6 7 8 9 10 11 12 13 code region Speedup Speedup Speedup 20 99.11% 20 93.67% 20 97.49% 15 15 15 10 10 10 5 5 5 64 64 64 16 0 8 0 0 8 16384 4 8192 CPU 16384 16384 8192 8192 4096 CPU CPU 2048 4096 4096 1024 2048 2048 1 512 1024 1024 1 1 256 512 512 ratio 256 256 128 64 128 ratio 128 ratio 64 64 Bandwdith (MB/s) Bandwdith (MB/s) Bandwdith (MB/s) (Previous slide: speedups up to 100x ) 24
Efficiency Models
Parallel efficiency model Computation Communication Do not blame MPI MPI_Send LB Comm MPI_Recv • Parallel efficiency = LB eff * Comm eff
Recommend
More recommend