Center for Information Services and High Performance Computing (ZIH) Combining Instrumentation and Sampling for Trace-based Application Performance Analysis 8th International Parallel Tools Workshop Stuttgart, Germany, October 2, 2014 Thomas Ilsche e (thomas.i .ilsche@t he@tu-dr dresden.de) esden.de) Joseph Schuchart Robert Schöne Daniel Hackenberg
Introduction Looking at the landscape of performance analysis tools – Identify established techniques – Provide a structured overview – Highlight strengths and weaknesses Identify novel combinations – Combine strengths – Mitigate weaknesses – Look beyond the traditional fields of tools 4 Thomas Ilsche
Classification of performance analysis techniques Data Presentation Profiles Timelines Data Recording Summarization Logging Event-based Data Acquisition Sampling Instrumentation Performance Analysis Layer Performance Analysis Technique Based on [10] Juckeland, G.: Trace-based Performance Analysis for Hardware Accelerators. Ph.D. thesis, TU Dresden (2012) 5 Thomas Ilsche
Classification of performance analysis techniques Data Presentation Profiles Timelines Data Recording Summarization Logging Event-based Data Acquisition Sampling Instrumentation Performance Analysis Layer Performance Analysis Technique Based on [10] Juckeland, G.: Trace-based Performance Analysis for Hardware Accelerators. Ph.D. thesis, TU Dresden (2012) 6 Thomas Ilsche
Data Acquisition: Event-based Instrumentation time main foo bar bar Measurement environment Event-based instrumentation ; also: direct instrumentation , event trigger , probe- based measurement or simply instrumentation . Modification of the application execution in order to record and present certain intrinsic events of the application execution, e.g., function entry and exit events. 7 Thomas Ilsche
Data Acquisition: Event-based Instrumentation Overhead & perturbation depends on function call rate – Hard to predict in complex applications – Can be influenced by filtering function calls • Preferably statically, not during runtime Complete information – Accurate function call counts – Message properties (semantics of function call arguments) – Analysis tools may rely on completeness 8 Thomas Ilsche
Data Acquisition: Event-based Instrumentation Various instrumentation methods available – Compiler instrumentation * – Library wrapping ** – Source code transformation * – Manual instrumentation * – Binary instrumentation * Requires recompilation & separate performance measurement binary ** Requires relinking for statically linked binaries 9 Thomas Ilsche
Data Acquisition: Sampling time 200us 400us 600us 800us main foo bar bar Measurement environment Sampling; also : statistical sampling or (ambiguously) profiling. Periodic interruption of a running program and inspection of its state. 10 Thomas Ilsche
Data Acquisition: Sampling Overhead & perturbation depends on sampling rate – Can be predicted – Can be controlled – Stack unwinding introduce uncertainty Easy to use (for end users) – No recompilation or relinking necessary – No filtering necessary 11 Thomas Ilsche
Data Acquisition: Sampling Incomplete information – No accurate function call counts – No specific message properties or other semantics of function arguments Measurement has statistical value – More reliable for longer running experiments Trade-off between accuracy and perturbation via sampling rate 12 Thomas Ilsche
Classification of performance analysis techniques Data Presentation Profiles Timelines Data Recording Summarization Logging Event-based Data Acquisition Sampling Instrumentation Performance Analysis Layer Performance Analysis Technique Based on [10] Juckeland, G.: Trace-based Performance Analysis for Hardware Accelerators. Ph.D. thesis, TU Dresden (2012) 13 Thomas Ilsche
Summarization vs Logging Defines how the recording during runtime is performed. Summarization Logging Event-based 0000us: Enter main 0050us: Enter foo Instrumentation 0100us: Enter bar 0300us: Leave bar 0650us: Leave foo 0700us: Enter bar 0900us: Leave bar 1000us: Leave main Sampling 14 Thomas Ilsche
Summarization vs Logging Defines how the recording during runtime is performed. Summarization Logging Event-based count[main]++ 0000us: Enter main count[foo]++ 0050us: Enter foo Instrumentation count[bar]++ 0100us: Enter bar time[bar]+=200 0300us: Leave bar time[foo]+=600 0650us: Leave foo count[foo]++ 0700us: Enter bar time[bar]+=200 0900us: Leave bar time[main]+=1000 1000us: Leave main Sampling 15 Thomas Ilsche
Summarization vs Logging Defines how the recording during runtime is performed. Summarization Logging Event-based count[main]++ 0000us: Enter main count[foo]++ 0050us: Enter foo Instrumentation count[bar]++ 0100us: Enter bar time[bar]+=200 0300us: Leave bar time[foo]+=600 0650us: Leave foo count[foo]++ 0700us: Enter bar time[bar]+=200 0900us: Leave bar time[main]+=1000 1000us: Leave main Sampling 200us: main|foo|bar 400us: main|foo 600us: main|foo 800us: main|bar 16 Thomas Ilsche
Summarization vs Logging Defines how the recording during runtime is performed. Summarization Logging Event-based count[main]++ 0000us: Enter main count[foo]++ 0050us: Enter foo Instrumentation count[bar]++ 0100us: Enter bar time[bar]+=200 0300us: Leave bar time[foo]+=600 0650us: Leave foo count[foo]++ 0700us: Enter bar time[bar]+=200 0900us: Leave bar time[main]+=1000 1000us: Leave main Sampling time_ex[bar] += 200 200us: main|foo|bar time_ex[foo] += 200 400us: main|foo time_ex[foo] += 200 600us: main|foo time_ex[bar] += 200 800us: main|bar 17 Thomas Ilsche
Summarization vs Logging Defines how the recording during runtime is performed. Summarization Logging Event-based count[main]++ 0000us: Enter main count[foo]++ 0050us: Enter foo Instrumentation count[bar]++ 0100us: Enter bar time[bar]+=200 0300us: Leave bar time[foo]+=600 0650us: Leave foo count[foo]++ 0700us: Enter bar time[bar]+=200 0900us: Leave bar time[main]+=1000 1000us: Leave main Sampling time_ex[bar] += 200 200us: main|foo|bar time_ex[foo] += 200 400us: main|foo time_ex[foo] += 200 600us: main|foo time_ex[bar] += 200 800us: main|bar Loses information Requires memory at runtime 18 Thomas Ilsche
Classification of performance analysis techniques Data Presentation Profiles Timelines Data Recording Summarization Logging Event-based Data Acquisition Sampling Instrumentation Performance Analysis Layer Performance Analysis Technique Based on [10] Juckeland, G.: Trace-based Performance Analysis for Hardware Accelerators. Ph.D. thesis, TU Dresden (2012) 19 Thomas Ilsche
Data Presentation Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 33.34 0.02 0.02 7208 0.00 0.00 open 16.67 0.03 0.01 244 0.04 0.12 offtime 16.67 0.04 0.01 8 1.25 1.25 memccpy 16.67 0.05 0.01 7 1.43 1.43 write Example timeline showing call-path Example profile (gprof) and event annotations (Vampir) Can be generated by summarization, • Needs logging during recording but also from logging • 20 Thomas Ilsche
Classification of performance analysis techniques Profiling Tracing Data Presentation Profiles Timelines Data Recording Summarization Logging Event-based Data Acquisition Sampling Instrumentation Performance Analysis Layer Performance Analysis Technique Based on [10] Juckeland, G.: Trace-based Performance Analysis for Hardware Accelerators. Ph.D. thesis, TU Dresden (2012) 21 Thomas Ilsche
Tools Profiling Tracing VampirTrace Scalasca TAU Score-P Event-based Extrae Instrumentation Example perf Concepts gprof Sampling HPCToolkit Allinea MAP 25 Thomas Ilsche
Combining Performance Analysis Techniques (1) C++ Graph code INDDGO OpenMP, 4 Threads Uninstrumented :< 6 seconds Instrumented (profiling): 72 seconds 1100% overhead! A trace file would be ~3.8 GB with even more overhead 26 Thomas Ilsche
Combining Performance Analysis Techniques (1) Estimated aggregate size of event trace: 3851MB Estimated requirements for largest trace buffer (max_buf): 3851MB Estimated memory requirements (SCOREP_TOTAL_MEMORY): 3860MB (hint: When tracing set SCOREP_TOTAL_MEMORY=3860MB to avoid intermediate flushes or reduce requirements using USR regions filters.) type max_buf[B] visits time[s] region 72 functions with ALL 4,038,048,140 161,849,290 119.61 ALL > 1 million visits USR 4,038,047,650 161,849,275 115.72 USR OMP 412 12 0.07 OMP COM 78 3 3.82 COM USR 365,389,440 14,053,440 3.58 Graph::lcgrand(int) USR 322,737,636 12,412,986 5.78 std::_List_iterator<int>::operator*() const USR 208,735,202 8,028,277 3.70 std::_List_iterator<int>::operator++() USR 201,389,266 7,745,741 3.02 std::_List_iterator<int>::_List_iterator … USR 200,350,128 12,521,883 6.12 std::_List_iterator<int>::operator !=… … USR 1,040,000 40,000 0.01 Graph::Node* std::__addressof … 27 Thomas Ilsche
Recommend
More recommend