pics a performance analysis based introspective control
play

PICS - a Performance-analysis-based Introspective Control System to - PowerPoint PPT Presentation

PICS - a Performance-analysis-based Introspective Control System to Steer Parallel Applications Yanhua Sun , Jonathan Lifflander, Laxmikant V. Kal e April 29, 2014 Yanhua Sun 1/24 Motivation Complexity Modern parallel computer systems are


  1. PICS - a Performance-analysis-based Introspective Control System to Steer Parallel Applications Yanhua Sun , Jonathan Lifflander, Laxmikant V. Kal´ e April 29, 2014 Yanhua Sun 1/24

  2. Motivation Complexity Modern parallel computer systems are becoming extremely complex due to complicated network topologies, hierarchical storage systems, heterogeneous processing units, etc. Obtaining best performance is challenging Applications and runtime should be reconfigurable to adapt to various situations The goal of the control system is to adjust the configuration automatically based on application-specific knowledge and runtime observations. Yanhua Sun 2/24

  3. Outline Overview of PICS framework Control points in the runtime system and applications Automatic performance analysis to speedup tuning APIs implemented in Charm++ Results of benchmarks and applications Yanhua Sun 3/24

  4. Overview of PICS framework Real‐world Mini apps applica/ons applica/ons Applica/on Applica/on control points reconfigura/on Automa/c performance Controller analysis PICS Expert Performance Performance knowledge instrumenta/on data rules Run/me control Run/me Adap/ve run/me system points reconfigura/on Yanhua Sun 4/24

  5. Control Points Control points Control points are tunable parameters for application and runtime to interact with control system. First proposed in Dooley’s research. 1 Name, Values : default, min, max 2 Movement unit: +1 , × 2 3 Effects, directions Degree of parallelism Grainsize Priority Memory usage GPU load Message size Number of messages other effects Yanhua Sun 5/24

  6. Application Control Points 1 Application specific control points provided by users 2 Applications should be able to reconfigure to use new values Control points Effects Use Cases sub-block size parallelism, grain size Jacobi, Wave, stencil code parallel threshold parallelism, overhead, grain size state space search stages in pipeline number of messages, message size pipeline collectives algorithm selection degree of parallelism, grain size 3D FFT decomposition (slab or pencil) software cache size memory usage, amount of communication ChaNGa ratio of GPU CPU load computation, load balance NAMD, ChaNGa Yanhua Sun 6/24

  7. Runtime System Control Points 1 Traditionally, configurations for the runtime system do not change 2 Configurations for the runtime system itself should be tunable 1 Registered by runtime itself 2 Requires no change from applications 3 Affect all applications Yanhua Sun 7/24

  8. Runtime System Control Points Control points Effects Use Cases broadcast algorithm selection communication most applications broadcast/reduction branch factor critical path most applications(NAMD) compression algorithm communication, overhead NAMD, ChaNGa fault tolerance frequency overhead, memory usage most applications load balancing frequency overhead, load balance most applications tracing data disk write frequency memory usage, overhead most applications number of AMPI virtual threads grain size AMPI applications Yanhua Sun 8/24

  9. Observe Program Behaviors Record all events Events : begin idle, end idle Functions: name, begin execution, end execution Communication : message creation, size, source/destination Hardware counters Module link, no source code modification Performance summary data Yanhua Sun 9/24

  10. Automatically Analyze the Performance Many control points are registered. How to reduce the search space? Performance Analysis - Identify Program Problems Decomposition Mapping Scheduling Yanhua Sun 10/24

  11. Decomposition Characteristics Decomposition problem? too much (1)too big Bytes per communication entry method message low on one object (2)too big Increase Replicate the objects single object grain size (3)too much High cache miss rate critical path (4)too few objects per PE Decrease grain size Yanhua Sun 11/24

  12. Mapping Characteristics Mapping problem? too much Communication time >> load imbalance communication LogP model time on one PE too much external Load Topology aware mapping communication balancer Remap Yanhua Sun 12/24

  13. Scheduling Characteristics scheduling problem? Critical tasks are delayed Prioritize the tasks Yanhua Sun 13/24

  14. Other Characteristics other problems? Bytes per Reduction Long latency message low broadcast Aggregate Compress Collectives Message message Yanhua Sun 14/24

  15. Correlate Performance with Control Points Performance summary CPU Utilization > 90% Overhead >10% Idle >10% Sequential Small Bytes Small Decomposition Others? Mapping problem? Scheduling problem? performance? per message entry methods problem? Increase Longer Larger Long Few Large Communication Critical Cache Miss Increase Large Bytes Long reduction aggregation entry single critical objects communication Load imbalance time >> tasks > 10% grain size per message broadcast threshold method object path per PE on one object model time are delayed Decrease Large Large Decrease Long latency Decrease Load aggregation Collectives Replicate objects external communication Topology aware mapping Prioritize the tasks grain size for big msgs grain size balancer threshold communication on one PE Compress Remap message One box can have multiple children One egg can have multiple parents Yanhua Sun 15/24

  16. Correlate Performance with Control Points Traverse the tree using the performance summary results performance results ⇒ solutions solution ⇒ effect of control points What control points to tune, in which direction! How much? MaxObjLoad grainsize : AvgLoad Feed results into the control points database Yanhua Sun 16/24

  17. Control System APIs t y p e d e f s t r u c t C o n t r o l P o i n t t { char name [ 3 0 ] ; enum TP DATATYPE datatype ; double d e f a u l t V a l u e ; double c u r r e n t V a l u e ; double minValue ; double maxValue ; double bestValue ; double moveUnit ; i n t moveOP ; i n t e f f e c t ; i n t e f f e c t D i r e c t i o n ; i n t s t r a t e g y ; i n t entryEP ; i n t o b jectI D ; } C o n t r o l P o i n t ; Yanhua Sun 17/24

  18. APIs for applications void r e g i s t e r C o n t r o l P o i n t ( C o n t r o l P o i n t ∗ tp ) ; void s t a r t S t e p ( ) ; void endStep ( ) ; void s t a r t P h a s e ( i n t phaseId ) ; void endPhase ( ) ; double getTunedParameter ( const char ∗ name , bool ∗ v a l i d ) ; Yanhua Sun 18/24

  19. Experimental Results of Benchmarks and Applications 1 Control points 2 Performance problem 3 Bluegene/Q machine, Cray XE6 machine Yanhua Sun 19/24

  20. Tuning Message Pipeline Control point: number of pipeline messages 16 16 timestep(less work) timestep(more work) pipeline(less work) 14 14 pipeline(more work) 12 12 number of pipeline messages timestep(ms/step) 10 10 8 8 6 6 4 4 2 2 0 0 10 20 30 40 50 60 70 80 step Figure: Tuning the number of pipeline messages Yanhua Sun 20/24

  21. Message Compression Control points: compression algorithm for each type message Runtime control points 2000 r1=0.1, r2=1.0 1500 timestep(ms/step) 1000 500 0 10 20 30 40 50 60 step Figure: Steering the compression algorithm for all-to-all benchmark Yanhua Sun 21/24

  22. Jacobi3d Performance Steering Control Points: sub-block size in each dimension Three control points Cache miss rate, high idle suggest decreasing sub-block size Overhead 4 total time idle time cpu time 3.5 runtime overhead 3 timestep(ms/step) 2.5 2 1.5 1 0.5 0 5 10 15 20 25 30 35 40 step Figure: Jacobi3d performance steering on 64 cores for problem of 1024*1024*1024 Yanhua Sun 22/24

  23. Communication Bottleneck in ChaNGa Control points: number of mirrors Ratio of maximum communication per object to average 2 tune mirrors with PICS no mirrors 1.9 1.8 1.7 1.6 s/step 1.5 1.4 1.3 1.2 1.1 1 5 10 15 20 25 steps Figure: Time cost of calculating gravity for various mirrors and no mirror on 16k cores on Blue Gene/Q Yanhua Sun 23/24

  24. Conclusion Automatic performance tuning is required to improve productivity and performance Automatic performance analysis helps guide performance steering Steering both runtime system and applications are important Yanhua Sun 24/24

Recommend


More recommend