PICS - a Performance-analysis-based Introspective Control System to Steer Parallel Applications Yanhua Sun , Jonathan Lifflander, Laxmikant V. Kal´ e Parallel Programming Laboratory University of Illinois at Urbana-Champaign sun51@illinois.edu June 10, 2014 Yanhua Sun Parallel Programming Laboratory, UIUC 1/25
Motivation 1 Modern parallel computer systems are becoming extremely complex due to network topologies, hierarchical storage systems, heterogeneous processing units, etc. 2 Obtaining the best performance is challenging. 3 Moreover, multiple configurations for the same application. Yanhua Sun Parallel Programming Laboratory, UIUC 2/25
Motivation 1 Modern parallel computer systems are becoming extremely complex due to network topologies, hierarchical storage systems, heterogeneous processing units, etc. 2 Obtaining the best performance is challenging. 3 Moreover, multiple configurations for the same application. time of using different number of messages to send data 512 1M (f:0.03125) time(us) 1 2 4 8 16 32 64 Number of messages Yanhua Sun Parallel Programming Laboratory, UIUC 2/25
Introspection and Adaptivity General Observation Configurations of tunable parameters in the runtime system and applications significantly affect the performance. Top Ten Exascale Research Challenges in DOE Report ”Introspection and automatic adaptation is listed as significant research topic to achieve the performance goal on exascale computers.” Yanhua Sun Parallel Programming Laboratory, UIUC 3/25
Introspection and Adaptivity General Observation Configurations of tunable parameters in the runtime system and applications significantly affect the performance. Top Ten Exascale Research Challenges in DOE Report ”Introspection and automatic adaptation is listed as significant research topic to achieve the performance goal on exascale computers.” Statement This work addresses the problem of how to improve both parallel programming productivity and performance by letting applications/runtime expose tunable parameters and letting the control system figure out the optimal configurations of these parameters. Yanhua Sun Parallel Programming Laboratory, UIUC 3/25
Related work Autotuning frameworks : generate multiple implementations (FFTW) Autopilot[Ribler et al.(1998)]: fuzzy logic rules, grid applications, resource managements MATE [Morajko 2006] : fully automatic tuning, performance model Active Harmony[Chung and Hollingsworth(2006)] : heuristic algorithms SEEC: A General and Extensible Framework for Self-Aware Computing[Henry Homann (2010,2011,2013)] Yanhua Sun Parallel Programming Laboratory, UIUC 4/25
Our Approach HPC applications on large scale Not rely on performance models Richer set of tunable parameters due to the powerful intelligent runtime system Not only application configurations are tunned, but also the runtime system itself Automatic performance analysis accelerates steering Yanhua Sun Parallel Programming Laboratory, UIUC 5/25
Outline Overview of PICS framework Control points in the runtime system and applications Automatic performance analysis to accelerate steering APIs implemented in Charm++ Results of benchmarks and applications Yanhua Sun Parallel Programming Laboratory, UIUC 6/25
Overview of PICS framework Real‐world Mini apps applica/ons applica/ons Applica/on Applica/on control points reconfigura/on Automa/c performance Controller analysis PICS Expert Performance Performance knowledge instrumenta/on data rules Run/me control Run/me Adap/ve run/me system points reconfigura/on Yanhua Sun Parallel Programming Laboratory, UIUC 7/25
Control Points Control points Control points are tunable parameters for application and runtime to interact with control system. First proposed in Dooley’s research. 1 Name, Values : default, min, max 2 Movement unit: +1 , × 2 3 Effects, directions Degree of parallelism Grainsize Priority Memory usage GPU load Message size Number of messages other effects Yanhua Sun Parallel Programming Laboratory, UIUC 8/25
Application and Runtime Control Points Application 1 Application specific control points provided by users 2 Applications should be able to reconfigure to use new values Runtime 1 Traditionally, configurations for the runtime system do not change 2 Configurations for the runtime system itself should be tunable 1 Registered by runtime itself 2 Requires no change from applications 3 Affect all applications Yanhua Sun Parallel Programming Laboratory, UIUC 9/25
Observe Program Behaviors Record all events Events : begin idle, end idle Functions: name, begin execution, end execution Communication : message creation, size, source/destination Hardware counters Module link, no source code modification Performance summary data Yanhua Sun Parallel Programming Laboratory, UIUC 10/25
Automatically Analyze the Performance Many control points are registered. How to reduce the search space? Yanhua Sun Parallel Programming Laboratory, UIUC 11/25
Automatically Analyze the Performance Many control points are registered. How to reduce the search space? Performance Analysis - Identify Program Problems Decomposition Mapping Scheduling Yanhua Sun Parallel Programming Laboratory, UIUC 11/25
Decomposition Characteristics Decomposition problem? too much (1)too big Bytes per communication entry method message low on one object (2)too big Increase Replicate the objects single object grain size (3)too much High cache miss rate critical path (4)too few objects per PE Decrease grain size Yanhua Sun Parallel Programming Laboratory, UIUC 12/25
Mapping Characteristics Mapping problem? too much Communication time >> load imbalance communication LogP model time on one PE too much external Load Topology aware mapping communication balancer Remap Yanhua Sun Parallel Programming Laboratory, UIUC 13/25
Scheduling Characteristics scheduling problem? Critical tasks are delayed Prioritize the tasks Yanhua Sun Parallel Programming Laboratory, UIUC 14/25
Other Characteristics other problems? Bytes per Reduction Long latency message low broadcast Aggregate Compress Collectives Message message Yanhua Sun Parallel Programming Laboratory, UIUC 15/25
Correlate Performance with Control Points Performance summary CPU Utilization > 90% Overhead >10% Idle >10% Sequential Small Bytes Small Decomposition Others? Mapping problem? Scheduling problem? performance? per message entry methods problem? Increase Longer Larger Long Few Large Communication Critical Cache Miss Increase Large Bytes Long reduction aggregation entry single critical objects communication Load imbalance time >> tasks > 10% grain size per message broadcast threshold method object path per PE on one object model time are delayed Decrease Large Large Decrease Long latency Decrease Load aggregation Collectives Replicate objects external communication Topology aware mapping Prioritize the tasks grain size for big msgs grain size balancer threshold communication on one PE Compress Remap message One box can have multiple children One egg can have multiple parents Yanhua Sun Parallel Programming Laboratory, UIUC 16/25
Correlate Performance with Control Points Traverse the tree using the performance summary results performance results ⇒ solutions solution ⇒ effect of control points What control points to tune, in which direction! How much? MaxObjLoad grainsize : AvgLoad Feed results into the control points database Yanhua Sun Parallel Programming Laboratory, UIUC 17/25
Control System APIs Implemented in Charm++, over-decomposition, asynchronous, message-driven model. (http://charm.cs.uiuc.edu/) t y p e d e f s t r u c t C o n t r o l P o i n t t { char name [ 3 0 ] ; enum TP DATATYPE datatype ; double d e f a u l t V a l u e ; double c u r r e n t V a l u e ; double minValue ; double maxValue ; double bestValue ; double moveUnit ; i n t moveOP ; i n t e f f e c t ; i n t e f f e c t D i r e c t i o n ; i n t s t r a t e g y ; i n t entryEP ; i n t o b jectI D ; } C o n t r o l P o i n t ; Yanhua Sun Parallel Programming Laboratory, UIUC 18/25
APIs for applications void r e g i s t e r C o n t r o l P o i n t ( C o n t r o l P o i n t ∗ tp ) ; void s t a r t S t e p ( ) ; void endStep ( ) ; double getTunedParameter ( const char ∗ name , bool ∗ v a l i d ) ; Yanhua Sun Parallel Programming Laboratory, UIUC 19/25
Experimental Results of Benchmarks and Applications 1 Control points 2 Performance problems 3 Bluegene/Q machine, Cray XE6 machine Yanhua Sun Parallel Programming Laboratory, UIUC 20/25
Tuning Message Pipeline Control point: number of pipeline messages 16 16 timestep(less work) timestep(more work) pipeline(less work) 14 14 pipeline(more work) 12 12 number of pipeline messages timestep(ms/step) 10 10 8 8 6 6 4 4 2 2 0 0 10 20 30 40 50 60 70 80 step Figure: Tuning the number of pipeline messages Yanhua Sun Parallel Programming Laboratory, UIUC 21/25
Recommend
More recommend