combining instrumentation and sampling for trace based
play

Combining Instrumentation and Sampling for Trace-based Application - PowerPoint PPT Presentation

Center for Information Services and High Performance Computing (ZIH) Combining Instrumentation and Sampling for Trace-based Application Performance Analysis 8th International Parallel Tools Workshop Stuttgart, Germany, October 2, 2014 Thomas


  1. Center for Information Services and High Performance Computing (ZIH) Combining Instrumentation and Sampling for Trace-based Application Performance Analysis 8th International Parallel Tools Workshop Stuttgart, Germany, October 2, 2014 Thomas Ilsche e (thomas.i .ilsche@t he@tu-dr dresden.de) esden.de) Joseph Schuchart Robert Schöne Daniel Hackenberg

  2. Introduction Looking at the landscape of performance analysis tools – Identify established techniques – Provide a structured overview – Highlight strengths and weaknesses Identify novel combinations – Combine strengths – Mitigate weaknesses – Look beyond the traditional fields of tools 4 Thomas Ilsche

  3. Classification of performance analysis techniques Data Presentation Profiles Timelines Data Recording Summarization Logging Event-based Data Acquisition Sampling Instrumentation Performance Analysis Layer Performance Analysis Technique Based on [10] Juckeland, G.: Trace-based Performance Analysis for Hardware Accelerators. Ph.D. thesis, TU Dresden (2012) 5 Thomas Ilsche

  4. Classification of performance analysis techniques Data Presentation Profiles Timelines Data Recording Summarization Logging Event-based Data Acquisition Sampling Instrumentation Performance Analysis Layer Performance Analysis Technique Based on [10] Juckeland, G.: Trace-based Performance Analysis for Hardware Accelerators. Ph.D. thesis, TU Dresden (2012) 6 Thomas Ilsche

  5. Data Acquisition: Event-based Instrumentation time main foo bar bar Measurement environment Event-based instrumentation ; also: direct instrumentation , event trigger , probe- based measurement or simply instrumentation . Modification of the application execution in order to record and present certain intrinsic events of the application execution, e.g., function entry and exit events. 7 Thomas Ilsche

  6. Data Acquisition: Event-based Instrumentation Overhead & perturbation depends on function call rate – Hard to predict in complex applications – Can be influenced by filtering function calls • Preferably statically, not during runtime Complete information – Accurate function call counts – Message properties (semantics of function call arguments) – Analysis tools may rely on completeness 8 Thomas Ilsche

  7. Data Acquisition: Event-based Instrumentation Various instrumentation methods available – Compiler instrumentation * – Library wrapping ** – Source code transformation * – Manual instrumentation * – Binary instrumentation * Requires recompilation & separate performance measurement binary ** Requires relinking for statically linked binaries 9 Thomas Ilsche

  8. Data Acquisition: Sampling time 200us 400us 600us 800us main foo bar bar Measurement environment Sampling; also : statistical sampling or (ambiguously) profiling. Periodic interruption of a running program and inspection of its state. 10 Thomas Ilsche

  9. Data Acquisition: Sampling Overhead & perturbation depends on sampling rate – Can be predicted – Can be controlled – Stack unwinding introduce uncertainty Easy to use (for end users) – No recompilation or relinking necessary – No filtering necessary 11 Thomas Ilsche

  10. Data Acquisition: Sampling Incomplete information – No accurate function call counts – No specific message properties or other semantics of function arguments Measurement has statistical value – More reliable for longer running experiments Trade-off between accuracy and perturbation via sampling rate 12 Thomas Ilsche

  11. Classification of performance analysis techniques Data Presentation Profiles Timelines Data Recording Summarization Logging Event-based Data Acquisition Sampling Instrumentation Performance Analysis Layer Performance Analysis Technique Based on [10] Juckeland, G.: Trace-based Performance Analysis for Hardware Accelerators. Ph.D. thesis, TU Dresden (2012) 13 Thomas Ilsche

  12. Summarization vs Logging Defines how the recording during runtime is performed. Summarization Logging Event-based 0000us: Enter main 0050us: Enter foo Instrumentation 0100us: Enter bar 0300us: Leave bar 0650us: Leave foo 0700us: Enter bar 0900us: Leave bar 1000us: Leave main Sampling 14 Thomas Ilsche

  13. Summarization vs Logging Defines how the recording during runtime is performed. Summarization Logging Event-based count[main]++ 0000us: Enter main count[foo]++ 0050us: Enter foo Instrumentation count[bar]++ 0100us: Enter bar time[bar]+=200 0300us: Leave bar time[foo]+=600 0650us: Leave foo count[foo]++ 0700us: Enter bar time[bar]+=200 0900us: Leave bar time[main]+=1000 1000us: Leave main Sampling 15 Thomas Ilsche

  14. Summarization vs Logging Defines how the recording during runtime is performed. Summarization Logging Event-based count[main]++ 0000us: Enter main count[foo]++ 0050us: Enter foo Instrumentation count[bar]++ 0100us: Enter bar time[bar]+=200 0300us: Leave bar time[foo]+=600 0650us: Leave foo count[foo]++ 0700us: Enter bar time[bar]+=200 0900us: Leave bar time[main]+=1000 1000us: Leave main Sampling 200us: main|foo|bar 400us: main|foo 600us: main|foo 800us: main|bar 16 Thomas Ilsche

  15. Summarization vs Logging Defines how the recording during runtime is performed. Summarization Logging Event-based count[main]++ 0000us: Enter main count[foo]++ 0050us: Enter foo Instrumentation count[bar]++ 0100us: Enter bar time[bar]+=200 0300us: Leave bar time[foo]+=600 0650us: Leave foo count[foo]++ 0700us: Enter bar time[bar]+=200 0900us: Leave bar time[main]+=1000 1000us: Leave main Sampling time_ex[bar] += 200 200us: main|foo|bar time_ex[foo] += 200 400us: main|foo time_ex[foo] += 200 600us: main|foo time_ex[bar] += 200 800us: main|bar 17 Thomas Ilsche

  16. Summarization vs Logging Defines how the recording during runtime is performed. Summarization Logging Event-based count[main]++ 0000us: Enter main count[foo]++ 0050us: Enter foo Instrumentation count[bar]++ 0100us: Enter bar time[bar]+=200 0300us: Leave bar time[foo]+=600 0650us: Leave foo count[foo]++ 0700us: Enter bar time[bar]+=200 0900us: Leave bar time[main]+=1000 1000us: Leave main Sampling time_ex[bar] += 200 200us: main|foo|bar time_ex[foo] += 200 400us: main|foo time_ex[foo] += 200 600us: main|foo time_ex[bar] += 200 800us: main|bar  Loses information  Requires memory at runtime 18 Thomas Ilsche

  17. Classification of performance analysis techniques Data Presentation Profiles Timelines Data Recording Summarization Logging Event-based Data Acquisition Sampling Instrumentation Performance Analysis Layer Performance Analysis Technique Based on [10] Juckeland, G.: Trace-based Performance Analysis for Hardware Accelerators. Ph.D. thesis, TU Dresden (2012) 19 Thomas Ilsche

  18. Data Presentation Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 33.34 0.02 0.02 7208 0.00 0.00 open 16.67 0.03 0.01 244 0.04 0.12 offtime 16.67 0.04 0.01 8 1.25 1.25 memccpy 16.67 0.05 0.01 7 1.43 1.43 write Example timeline showing call-path Example profile (gprof) and event annotations (Vampir) Can be generated by summarization, • Needs logging during recording but also from logging • 20 Thomas Ilsche

  19. Classification of performance analysis techniques Profiling Tracing Data Presentation Profiles Timelines Data Recording Summarization Logging Event-based Data Acquisition Sampling Instrumentation Performance Analysis Layer Performance Analysis Technique Based on [10] Juckeland, G.: Trace-based Performance Analysis for Hardware Accelerators. Ph.D. thesis, TU Dresden (2012) 21 Thomas Ilsche

  20. Tools Profiling Tracing VampirTrace Scalasca TAU Score-P Event-based Extrae Instrumentation Example perf Concepts gprof Sampling HPCToolkit Allinea MAP 25 Thomas Ilsche

  21. Combining Performance Analysis Techniques (1) C++ Graph code INDDGO OpenMP, 4 Threads Uninstrumented :< 6 seconds Instrumented (profiling): 72 seconds  1100% overhead! A trace file would be ~3.8 GB with even more overhead 26 Thomas Ilsche

  22. Combining Performance Analysis Techniques (1) Estimated aggregate size of event trace: 3851MB Estimated requirements for largest trace buffer (max_buf): 3851MB Estimated memory requirements (SCOREP_TOTAL_MEMORY): 3860MB (hint: When tracing set SCOREP_TOTAL_MEMORY=3860MB to avoid intermediate flushes or reduce requirements using USR regions filters.) type max_buf[B] visits time[s] region 72 functions with ALL 4,038,048,140 161,849,290 119.61 ALL > 1 million visits USR 4,038,047,650 161,849,275 115.72 USR OMP 412 12 0.07 OMP COM 78 3 3.82 COM USR 365,389,440 14,053,440 3.58 Graph::lcgrand(int) USR 322,737,636 12,412,986 5.78 std::_List_iterator<int>::operator*() const USR 208,735,202 8,028,277 3.70 std::_List_iterator<int>::operator++() USR 201,389,266 7,745,741 3.02 std::_List_iterator<int>::_List_iterator … USR 200,350,128 12,521,883 6.12 std::_List_iterator<int>::operator !=… … USR 1,040,000 40,000 0.01 Graph::Node* std::__addressof … 27 Thomas Ilsche

Recommend


More recommend