GAPP: A Fast Profiler for Detecting Serialization Bottlenecks in Parallel Linux Applications Reena Nair Tony Field
What causes serialization bottlenecks? • Resource Contention • Load Imbalance Hardware Software Execution Time CPU Locks Peripherals Thread ID
Serialization Bottlenecks – Reduced Parallelism Barrier Core1 Thread 1 Thread 1 Thread 1 Core2 Thread 2 Thread 2 Thread 2 Thread 3 Core3 Thread 3 Thread 3 Core4 Thread 4 Thread 4 Thread 4 time Max Parallelism Reduced Parallelism Thread Blocked
Profilers for debugging performance issues Profiler Profiler Memory Locks A B Profiler Profiler Critical Peripherals D C Thread • There are many different sources of bottlenecks.
GAPP – Generic Automatic Parallel Profiler • Can identify several different types of serialization bottlenecks. • No need to instrument the application. • Validated on multithreaded and multi-process parallel applications written in C/C++. • Implemented using extended Berkley Packet Filter (eBPF). – Provides fast and secure kernel tracing (~4% average runtime overhead).
Harness the symptom rather than the cause • Identify when and where reduced parallelism is exhibited – Number of active threads, N act <= N min , a tuneable threshold variable with a default value of N/2, where N is the total number of threads • Trace context switch events in the kernel. – Retrieve stack trace at the end of a time slice Time Slice • Reduce overhead - retrieve Barrier stack traces only from Core1 Thread 1 critical time slices Core2 Thread 2 • Critical time-slice – whose average active thread count Core3 Thread 3 is <= N min Core4 • Omit ST 2 Thread 4 time 1 3 2 Active Threads 4 Reduced Parallelism ST 3 ST 2 Stack Traces (ST)
Are stack traces enough to identify bottleneck? • Stack traces retrieved at the end of a time-slice would point to bottleneck code only if it happened to execute at the end of a time-slice. Missed Bottleneck? Core1 Thread 1 Core2 Thread 2 Core3 Thread 3 Core4 Thread 4 Active Threads 2 1 2 1 1 3 Stack Traces (ST) ST 1 ST 4 ST 2 ST 3
Combining bottleneck code and call paths IP 1 IP 2 IP 3 (Periodic Samples) • Periodically sample T 1 Core1 instruction pointers. IP 4 IP 5 IP 6 X • Reject samples if N act > N min Core2 T 2 IP 7 X IP 8 IP 9 IP 10 • Combine instruction pointers Core3 T 3 and stack traces of each IP 11 X IP 12 IP 13 critical time-slice Core4 T 4 • Each critical time-slice is assigned a metric, Criticality Metric 1 ( Cmetric), which Active Threads 1 1 1 2 3 2 takes into account the duration and degree of IP 12 IP 8 IP 4 IP 1 parallelism of a time-slice. IP 9 IP 5 IP 13 IP 2 Stack Traces (ST) IP 10 ST 2 ST 4 IP 3 ST 3 ST 1 1 Du Bois, Kristof, et al. "Criticality stacks: Identifying critical threads in parallel programs using synchronization behavior .“, ISCA ‘13
Ranking Bottlenecks • Similar call paths, their samples and CMetric are combined and sorted to display potential critical call paths, functions and lines of codes and Cmetric of individual threads. Critical Path 1: ThreadID CMetric deflate_slow() 25778 256130902 <---deflate() <---compress() 25779 417320962 <---Compress() Functions and lines + Frequency 25783 5003332502 deflate_slow – 1465 25784 5003756997 deflate.c:1650 (StackTop) -- 575 deflate.c:1580 -- 354 Load Imbalance, if any Optimization Opportunities
GAPP - Evaluation • Evaluated using applications from the Parsec-3.0 benchmark suite and two large open source projects, MySQL and Nektar++. • All applications except Nektar++ were multithreaded • Each was executed with 64 threads. • Nektar++, a spectral/hp element framework which uses message passing, was executed with 16 MPI processes.
Load imbalance from thread CMetric Multithreaded Task Parallel Application - Ferret • Six pipeline stages - first and last stages perform I/O with single threads. Feature Load Segmentation Indexing Ranking Out extraction 1 15 15 15 15 1 Fig: Ferret pipeline stages with initial thread allocation Functions and lines + Frequency Critical Path 1: isOptimal -- 41314 emd () emd.c:422 -- 20813 <---sdist_emd () emd.c:423 -- 10760 <---raw_query () emd.c:420 -- 6657 <---cass_table_query () findBasicVariables -- 41301 <---t_rank () emd.c:350 -- 7366 <---start_thread () emd.c:353 -- 6713 emd.c:383 -- 5827 Fig: GAPP Profile for Ferret
Optimizing Ferret by thread reallocation • Ranking phase exhibited higher CMetric when compared to other stages. • Optimized by re-allocating threads to ranking phase. Initial thread Thread Allocations allocation Run Time: 30s 15-15-15-15 CMetric Values 15-5-15-25 2-1-18-39 Run Time: 20s After Optmization Run Time: 15s Thread Index Fig: Cmetric for different thread allocations - Ferret
Resource Contention – MySQL Sysbench OLTP_read_write workload Critical Path1 Critical Path 2 fil_flush()[mysqld] sync_array_reserve_cell() <---log_write_up_to() <---rw_lock_s_lock_spin() <--trx_commit_complete_for_mysql() <---pfs_rw_lock_s_lock_func() <---innobase_commit() <---row_search_mvcc() <---ha_commit_low() <---ha_innobase::index_read() <---TC_LOG_DUMMY::commit() <---handler::ha_index_read_idx_map() <---ha_commit_trans() <---join_read_const_table() <---trans_commit() <---JOIN::extract_func_dependent_tables() <---mysql_execute_command() <---JOIN::make_join_plan() <---Prepared_statement::execute() <---JOIN::optimize() Functions and lines + frequency Functions and lines + frequency pfs_os_file_flush_func -- 1462 sync_array_reserve_cell() -- 469 sync0arr.cc:389 (StackTop) -- 469 os0file.ic:507 (StackTop) -- 1462 Spin-wait Loop Disk I/O
Optimizing MySQL Critical Function2 Critical Function1 (Software Resource Contention) (Hardware Resource Contention) • • pfs_os_file_flush_func() sync_array_reserve_cell() – Invoked by InnoDB, flushes – Invoked from a custom built write buffers to disk spin lock, that blocks after spinning for a predefined – Increasing buffer size time. improved transaction rate by – Increasing spin wait time 19% and reduced latency by 16% reduced cache misses by 10.6% These 2 modifications cumulatively improved query transaction rate by 34% and reduced average latency by 25%.
Bodytrack – Parsec3.0 Multithreaded application that follows producer-consumer paradigm Read next set of Update Images images from queue AsyncIO Thread Queue Send command to Estimate worker threads Critical Call Path1 void FlexDownSample2 () WritePose Pool of worker <---TrackingModel::OutputBMP() threads <---mainPthreads() <---main () Delegate to Writer OutputBMP Thread Improved performance by 22% Main Producer Loop
GAPP on MPI Applications • Nektar++ - a spectral/hp element framework that implements several PDE solvers. • Evaluated using the Incompressible Navier-Stokes Solver with 16 MPI processes. 12 • Load imbalance was found Normalised CMetric 10 to be due to non-uniform partitioning of the mesh. 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Task ID Fig:Cmetric of Individual Processes
GAPP Profile - Nektar ++ For each Combining functions and critical path lines from critical paths Critical Path 1: Top Critical Functions and lines + Frequency __GI___poll ()[libc-2.27.so] <---MPIDI_CH3I_Progress ()[libmpi.so.12.1.1] dgemv_ () [libblas.so.3.8.0] -- 781 <---MPIC_Wait ()[libmpi.so.12.1.1] <---MPIC_Recv ()[libmpi.so.12.1.1] double Vmath::Dot2<double>() <---MPIR_Bcast_binomial ()[libmpi.so.12.1.1] [libLibUtilities-g.so.5.0.0b] -- 170 <---MPIR_Bcast_intra ()[libmpi.so.12.1.1] <---MPIR_Bcast ()[libmpi.so.12.1.1] gather_double_add () <---MPIR_Bcast_impl ()[libmpi.so.12.1.1] [libMultiRegions-g.so.5.0.0b] -- 100 <---MPIR_Allreduce_intra ()[libmpi.so.12.1.1] <---MPIR_Allreduce_impl ()[libmpi.so.12.1.1] Functions and lines + Frequency dgemv_ () [libblas.so.3.8.0] -- 594 double Vmath::Dot2<double>() [libLibUtilities-g.so.5.0.0b] -- 116 gather_double_add () [libMultiRegions-g.so.5.0.0b] -- 58
Optimizing critical functions – Nektar++ Before Optimization After Optimization 80 60 Bottleneck Function 60 (dgemv) 40 Count Count 40 20 20 0 0 F1 F2 F3 F2 F4 F1 Function Name Function Name • Bottleneck Function – matrix multiplication routine exported by the BLAS library. • Replacing the default BLAS libraries with OpenBLAS improved run time by 27%.
Conclusion • GAPP was able to identify different types of serialization bottlenecks in different class of applications. • Robust – Consistent results across multiple runs under the same test condition. • Customizable – Tuneable parameters: N min , sampling frequency, stack depth, option to include results from dynamic libraries • Limitation – Will not work with spin- wait loops which doesn’t block. • Available at – https://github.com/RN-dev-repo/GAPP
Thank You
Recommend
More recommend