Runtime Optimization of Application Level Communication Patterns Edgar Gabriel and Shuo Huang Department of Computer Science University of Houston gabriel@cs.uh.edu HIPS 2007 Long Beach Edgar Gabriel
Motivation Finite Difference code on a PC cluster using IB and GE interconnects Execution time for 200 iterations of the solver on 32 processes/processors 30 25 execution time [sec ] 20 fcfs fcfs-pack p 15 ordered overlap 10 5 0 HIPS 2007 Long Beach 128x128x64 IB 128x128x128 IB 128x128x64 TCP 128x128x128 TCP Edgar Gabriel
How to implement the required communication pattern efficiently? • Dependence on platform – Some functionality only supported (efficiently) on certain/platforms or with certain network interconnects • Dependence on MPI library – Does the MPI library support all available methods – Efficiency in overlapping communication and computation – Quality of the support for user defined data-types • Dependence on application – Problem size – Ratio of communication to computation HIPS 2007 Long Beach Edgar Gabriel
• Problem : How can an (average) user understand the myriad of implementation options and their impact on the performance of the application? • (Honest) Answer : no way – Abstract interfaces for application level communication operations required ADCL – Statistical tools required to detect correlations between parameters and application performance HIPS 2007 Long Beach Edgar Gabriel
ADCL - Adaptive Data and Communication Library • Goals: – Provide abstract interfaces for often occurring application level communication patterns • Collective operations • Not-covered by MPI specification – Provide a wide variety of implementation possibilities and decision routines which choose the fastest available implementation (at runtime) • Not replacing MPI, but add-on functionality – Uses many features of MPI HIPS 2007 Long Beach Edgar Gabriel
ADCL terminology ADCL object Functionality Attribute Abstraction for a characteristic of an implemen- tation represented by the set its possible values Attribute-set Group of attributes Function Implementation of a particular operation • optionally including an attribute-set and values Set of functions providing the same functionality Function-set • have to have the same attribute-set Vector Abstraction for a multi-dimensional data object Topology Abstraction for a process topology Handle for tuple of < topology, vector, Request function-set> HIPS 2007 Long Beach Edgar Gabriel
Code sample ADCL_Vector vec; ADCL_Topology topo; ADCL_Request request; /* Generate a 2-D process topology */ MPI_Cart_create ( comm, 2, cart_dims, periods, 0,&cart_comm); ADCL_Topology_create ( cart_comm, &topo ); /* Register a 2D vector with ADCL */ ADCL_Vector_register (ndims, vec_dims, HALO_WIDTH, MPI_DOUBLE, vector, &vec); /* Match process topology, data item and function-set */ ADCL_Request_create (vec, topo, ADCL_FNCTSET_NEIGHBORHOOD, &request ); for (i=0; i<NIT; i++ ) { /* Main application loop */ ADCL_Request_start (request ); … HIPS 2007 Long Beach } Edgar Gabriel
Runtime selection logic: brute force search (I) Implementation no. Using the fastest implementation for 1 2 3 4 5 6 7 the rest of the application HIPS 2007 Long Beach Edgar Gabriel
Runtime selection logic: brute force search (II) • Test each function of a given function set a given number of times – Store the execution time for each execution per process • Filter the list of execution times in order to exclude outliers • Determine the avg. execution time per function i and process j • Determine the max. execution time for function i across all processes = = − max j f max( f ), j 0 ... nprocs 1 i i – Requires communication (e.g. MPI_Allreduce ) HIPS 2007 Long Beach Edgar Gabriel
Runtime selection logic: brute force search (III) • Determine the function with the minimal max. execution time across all processes = = − max f min( f ), i 0 ... nfuncs 1 winner i • Use this function for the rest of the application lifetime HIPS 2007 Long Beach Edgar Gabriel
Runtime selection logic: performance hypothesis (I) • Assumptions: – every implementation can be characterized by a set of attributes, which impact its performance, e.g. for neighborhood communication • Communication pattern/degree • Handling of non-contiguous data • Data transfer primitive • Overlapping communication and computation – The fastest implementation will also have the optimal values for these attributes HIPS 2007 Long Beach Edgar Gabriel
Runtime selection logic: performance hypothesis (II) • Approach: determine the optimal value for an attribute by comparing the execution time of functions differing in only a single attribute Function c Function a Function b 1 2 3 Value for attribute 1 X X X Value for attribute 2 Y Y Y Value for attribute 3 z z z Value for attribute 4 – E.g. if function c had the lowest execution time across all processes: • Hypothesi s: value 3 optimal for attribute 1 • Confidence value in this hypothesis: 1 HIPS 2007 Long Beach Edgar Gabriel
Runtime selection logic: performance hypothesis (III) • Evaluate a different set of functions differing in one other attribute, e.g. Function e Function c Function d 1 2 3 Value for attribute 1 X+1 X+1 X+1 Value for attribute 2 Y Y Y Value for attribute 3 z z z Value for attribute 4 – If this set of measurements lead to the same optimal value for attribute 1: • Increase confidence value for this hypothesis by 1 – Else decrease the confidence value by 1 HIPS 2007 Long Beach Edgar Gabriel
Runtime selection logic: performance hypothesis (IV) • If the confidence value for an attribute reaches a given threshold – Remove all functions not having the required value for this attribute from the Function-set • If the value for attribute (s) do not converge towards a value this algorithm leads to the brute force search • Advantage: potentially fewer functions have to be evaluated to determine the winner HIPS 2007 Long Beach Edgar Gabriel
Currently available implementations for neighborhood communication Name Comm. pattern Handling of Data transfer primitive non-cont. data IsendIrecv_aao aao ddt MPI_Isend/Irecv/Waitall IsendIrecv_pair pair ddt MPI_Isend/Irecv/Waitall SendIrecv_aao aao ddt MPI_Send/Irecv/Waitall SendIrecv_pair pair ddt MPI_Send/Irecv/Wait IsendIrecv_aao_pack aao ddt MPI_Isend/Irecv/Waitall IsendIrecv_pair_pack pair Pack/unpack MPI_Isend/Irecv/Waitall SendIrecv_aao_pack aao ddt MPI_Send/Irecv/Waitall SendIrecv_pair_pack pair Pack/unpack MPI_Send/Irecv/Wait SendRecv_pair pair ddt MPI_Send/Recv Sendrecv_pair pair ddt MPI_Send/Recv SendRecv_pair_pack pair Pack/unpack MPI_Send/Recv Sendrecv_pair_pack pair Pack/unpack MPI_Send/Recv WinfencePut_aao aao ddt MPI_Put/MPI_Win_fence WinfenceGet_aao aao ddt MPI_Get/MPI_Win_fence PostStartPut_aao aao ddt MPI_Put/MPI_Win_post/start PostStartGet_aao aao ddt MPI_Get/MPI_Win_post/start WinfencePut_pair pair ddt MPI_Put/MPI_Win_fence WinfenceGet_pair pair ddt MPI_Get/MPI_Win_fence PostStartPut_pair pair ddt MPI_Put/MPI_Win_post/start HIPS 2007 Long Beach PostStartGet_pair pair ddt MPI_Get/MPI_Win_post/start Edgar Gabriel
Performance results (I) InfiniBand 32 processes small problem size 12.4 12.2 12 Execution time [sec] 11.8 11.6 11.4 11.2 11 10.8 10.6 10.4 o o r r r r k k k k e o k k i i i i a a a a a c c c c c t p a c u a a a a p p p a a a a y p r p p _ _ _ p p p h _ _ _ p b _ _ v v v _ _ _ _ v v v c c c c c o o r r r c r i i i i e e e e e a a a a e a a a p r r r R r p r a p p I I I I d d d _ _ _ d d d _ _ _ n v v n n n n v v v n v e c c c e e e e c c c e e e S e e e s S s S e S r r r R r I I r I I I I d d d d d d n n n n n n e e e e e e s s S S S S I I HIPS 2007 Long Beach Edgar Gabriel
Performance results (II) InfiniBand 32 processes large problem size 77.5 77 76.5 Execution time [sec] 76 75.5 75 74.5 74 73.5 73 72.5 o o r r r r k k k k k k i i e o i i a a a a a a c c c c c c t p a a a a a a p p p a a u p y _ _ p p p p _ _ _ _ p p r h b v v v v v _ _ _ _ _ v _ r c c c c c o o r r r c i i i e e e e e a a i e a a a a r r r r a p p R r a p p I I I I d d d d d _ _ _ _ _ _ d n n n n n v v v v v v n e c c e e e e c c c c e S e e e e e s S s S e S r r I I r r r R I I I I d d d d d d n n n n n n e e e e e e S s s S S S I I HIPS 2007 Long Beach Edgar Gabriel
Performance results (III) TCP over Fast Ethernet 32 processes small problem size 400 350 Execution time [sec] 300 250 200 150 100 50 0 IsendIrecv_aao SendIrecv_aao IsendIrecv_pair SendRecv_pair SendIrecv_pair Sendrecv_pair IsendIrecv_aao_pack SendIrecv_aao_pack IsendIrecv_pair_pack SendRecv_pair_pack SendIrecv_pair_pack brute hypo Sendrecv_pair_pack HIPS 2007 Long Beach Edgar Gabriel
Recommend
More recommend