i ntroduction to parallel perform ance engineering bert w
play

I ntroduction to Parallel Perform ance Engineering Bert W esarg - PowerPoint PPT Presentation

VIRTUAL INSTITUTE HIGH PRODUCTIVITY SUPERCOMPUTING I ntroduction to Parallel Perform ance Engineering Bert W esarg Technische Universitt Dresden (with content used with permission from tutorials by Bernd Mohr/ JSC and Luiz DeRose/ Cray)


  1. VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING I ntroduction to Parallel Perform ance Engineering Bert W esarg Technische Universität Dresden (with content used with permission from tutorials by Bernd Mohr/ JSC and Luiz DeRose/ Cray)

  2. VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Perform ance: an old problem “ The most constant difficulty in contriving the engine has arisen from the desire to Difference Engine reduce the time in which the calculations were executed to the shortest which is possible. ” Charles Babbage 1791 – 1871 PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 2

  3. VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Today: the “free lunch” is over Moore's law is still in charge, but ■ Clock rates no longer increase ■ Performance gains only through ■ increased parallelism Optimizations of applications more ■ difficult Increasing application complexity ■ Multi-physics ■ Multi-scale ■ Increasing machine complexity ■ ■ Hierarchical networks / memory ■ More CPUs / multi-core  Every doubling of scale reveals a new bottleneck! PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 3

  4. VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Perform ance factors of parallel applications “Sequential” performance factors ■ Computation ■  Choose right algorithm, use optimizing compiler Cache and memory ■  Tough! Only limited tool support, hope compiler gets it right Input / output ■  Often not given enough attention “Parallel” performance factors ■ Partitioning / decomposition ■ ■ Communication (i.e., message passing) Multithreading ■ Synchronization / locking ■  More or less understood, good tool support PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 4

  5. VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Tuning basics Successful engineering is a combination of ■ Careful setting of various tuning parameters ■ The right algorithms and libraries ■ Compiler flags and directives ■ … ■ ■ Thinking !!! Measurement is better than guessing ■ To determine performance bottlenecks ■ To compare alternatives ■ ■ To validate tuning decisions and optimizations  After each step! PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 5

  6. VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Perform ance engineering w orkflow • Prepare application with symbols • Collection of performance data • Insert extra code (probes/ hooks) • Aggregation of performance data Preparation Measurement Optimization Analysis • Modifications intended to • Calculation of metrics eliminate/ reduce performance • Identification of performance problem problems • Presentation of results PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 6

  7. VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING The 8 0 / 2 0 rule Programs typically spend 80% of their time in 20% of the code ■ Programmers typically spend 20% of their effort to get 80% of the total speedup ■ possible for the application  Know when to stop! Don't optimize what does not matter ■  Make the common case fast! “ If you optimize everything, you will always be unhappy. ” Donald E. Knuth PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 7

  8. VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Metrics of perform ance What can be measured? ■ A count of how often an event occurs ■ ■ E.g., the number of MPI point-to-point messages sent The duration of some interval ■ E.g., the time spent these send calls ■ The size of some parameter ■ E.g., the number of bytes transmitted by these calls ■ Derived metrics ■ E.g., rates / throughput ■ ■ Needed for normalization PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 8

  9. VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Exam ple m etrics Execution time ■ Number of function calls ■ CPI ■ ■ CPU cycles per instruction FLOPS ■ Floating-point operations executed per second ■ “ math ” Operations? HW Operations? HW Instructions? 32-/64-bit? … PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 9

  10. VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Execution tim e Wall-clock time ■ Includes waiting time: I/ O, memory, other system activities ■ In time-sharing environments also the time consumed by other applications ■ CPU time ■ ■ Time spent by the CPU to execute the application Does not include time the program was context-switched out ■ ■ Problem: Does not include inherent waiting time (e.g., I/ O) Problem: Portability? What is user, what is system time? ■ Problem: Execution time is non-deterministic ■ Use median of several runs, or at least the arithmetic mean ■ PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 10

  11. VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING I nclusive vs. Exclusive values Inclusive ■ Information of all sub-elements aggregated into single value ■ Exclusive ■ Information cannot be subdivided further ■ int foo() { int a; a = 1 + 1; Inclusive Exclusive bar(); a = a + 1; return a; } PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 11

  12. VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Classification of m easurem ent techniques How are perform ance m easurem ents triggered? ■ Sam pling ■ Code instrum entation ■ How is performance data recorded? ■ Profiling / Runtime summarization ■ Tracing / Logging ■ How is performance data analyzed? ■ Post mortem ■ Online ■ PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 12

  13. VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Sam pling t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 Time main foo(0) foo(1) foo(2) Measurement int main() {  Running program is periodically interrupted to take int i; measurement for (i=0; i < 3; i++) foo(i);  Timer interrupt, OS signal, or HWC overflow  Service routine examines return-address stack return 0; }  Addresses are mapped to routines using symbol table information void foo(int i)  Statistical inference of program behavior {  Not very detailed information on highly volatile metrics if (i > 0)  Requires long-running applications foo(i – 1);  Works with unmodified executables } PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 13

  14. VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING I nstrum entation t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 t 12 t 13 t 14 Time Time main foo(0) foo(1) foo(2) Measurement int main() { int i;  Measurement code is inserted such that every event Enter( “ main ” ); for (i=0; i < 3; i++) of interest is captured directly foo(i); Leave( “ main ” );  Can be done in various ways return 0;  Advantage: }  Much more detailed information void foo(int i)  Disadvantage: { Enter( “ foo ” );  Processing of source-code / executable if (i > 0) necessary foo(i – 1); Leave( “ foo ” );  Large relative overheads for small functions } PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 14

  15. VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING I nstrum entation techniques Static instrumentation ■ Program is instrumented prior to execution ■ Dynamic instrumentation ■ Program is instrumented at runtime ■ Code is inserted ■ Manually ■ Automatically ■ By a preprocessor / source-to-source translation tool ■ By a compiler ■ By linking against a pre-instrumented library / runtime system ■ ■ By binary-rewrite / dynamic instrumentation tool PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 15

  16. VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Critical issues Accuracy ■ Intrusion overhead ■ ■ Measurement itself needs time and thus lowers performance Perturbation ■ Measurement alters program behaviour ■ E.g., memory access pattern ■ ■ Accuracy of timers & counters Granularity ■ How many measurements? ■ How much information / processing during each measurement? ■  Tradeoff: Accuracy vs. Expressiveness of data PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 16

  17. VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Classification of m easurem ent techniques How are performance measurements triggered? ■ Sampling ■ Code instrumentation ■ How is perform ance data recorded? ■ Profiling / Runtim e sum m arization ■ Tracing / Logging ■ How is performance data analyzed? ■ Post mortem ■ Online ■ PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 17

  18. VIRTUAL INSTITUTE – HIGH PRODUCTIVITY SUPERCOMPUTING Profiling / Runtim e sum m arization Recording of aggregated information ■ Total, maximum, minimum, … ■ For measurements ■ Time ■ ■ Counts ■ Function calls ■ Bytes transferred Hardware counters ■ Over program and system entities ■ ■ Functions, call sites, basic blocks, loops, … Processes, threads ■  Profile = summarization of events over the whole execution interval PERFORMANCE ENGINEERING WITH SCORE-P AND VAMPIR, PASSAU, SEPTEMBER 15, 2015 18

Recommend


More recommend