Hierarchy Aware Blocking and Nonblocking Collective Communications-The Effects of Shared Memory Communications in the Cray XT Environment Richard L. Graham, Joshua S. Ladd, Manjunath Venkata 1 Managed by UT-Battelle 1 Managed by UT-Battelle for the Department of Energy for the Department of Energy Graham_CAC_2010 Graham_CAC_2010
Acknowledgements • US Department of Energy FASTOS program 2 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010
Outline • Statement of the problem • Design Overview • Results • Next steps 3 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010
Problems being addressed • Optimization of collective operations • Implementation of extensible optimized collective operations • Implementation of nonblocking collective operations 4 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010
Why Optimize Collective Communications • Collective operations limit application scalability • Communication pattern involving multiple processes (in MPI, all ranks in the communicator are involved) • Optimized collectives involve a communicator-wide data-dependent communication pattern • Data needs to be manipulated at intermediate stages of a collective operation • Collective operations magnify the effects of system- noise 5 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010
Scalability of Collective Operations Ideal Algorithm Impact of System Noise 3'&/ : 3'&/ : ,)75(61'. ,)75(61'. 4'225.1(-61'. 4'225.1(-61'. 8*)+,)*596 8*)+,)*596 ;'1*) $ 012) 012) $ ! " # $ ! " # $ %&'()**+,-./ %&'()**+,-./ 6 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010
Scalability of Collective Operations - II Offloaded Algorithm Nonblocking Algorithm 3'&/ = 3'&/ = ,)75(61'. ,)75(61'. 4'225.1(-61'. 4'225.1(-61'. 8*)+,)*596 8*)+,)*596 :)9);-61'.+<;).6 :)9);-61'.+<;).6 012) $ $ 012) ! " # $ ! " # $ %&'()**+,-./ %&'()**+,-./ 7 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010
Mapping the collectives onto the system • Consider communication hierarchies • Schedule the network 8 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010
Example – 4 Process Recursive Doubling Host 1 Host 2 1 2 3 4 Step 1 1 2 3 4 Inter Host Step 2 Communication 1 2 3 4 9 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010
Example – 4 Process Recursive Doubling – On host optimization Host 1 Host 2 1 2 3 4 Step 1 1 2 3 4 Inter Host Step 2 Communication 1 2 3 4 Step 3 1 2 3 4 10 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010
Design strategy • Decouple – Hierarchy detection – Network specific collective algorithm implementation (“single” level) – Full collective function implementation (hierarchical) – Basic building blocks from MPI level functions • Share resources between levels w/o breaking the abstraction between layers 11 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010
Collectives – Software Layers OMPI Module Component Architecture Collective Framework Basic Collectives (bcol) Framework Subgroup Framework SM NUMA MUMA IBNET Pt2Pt ML – Hierarchical Tuned (pt2pt) IB Collectives Comp. Collectives Comp. OFFLOAD MLNX OFED 12 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010
Benchmarks 13 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010
System setup • Jaguar • 2.6 GHz Istanbul processor • Dual socket • Hex-core • Smoky – 2.0 GHz Opteron – Quad socket – Quad core 14 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010
Barrier as a function of Process count – Jaguar – 2 Level hierarchy 9 Shared Memory pt-2-pt 8 Latency of the Barrier (usecs) 7 6 5 4 3 2 1 0 2 4 6 8 10 12 Processes 15 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010
Barrier as a function of Process count – Smoky – 2 Level hierarchy 12 Shared Memory pt-2-pt Latency of the Barrier (usecs) 10 8 6 4 2 0 2 4 6 8 10 12 14 16 Processes 16 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010
Barrier As a function of number of sockets - Jaguar 2 Latency of the Barrier (usecs) Processes on Same Socket 1.5 Processes on Different Sockets 1 0.5 0 2 4 Processes 17 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010
Barrier As a function of number of sockets (1,2) – Smoky 2 Latency of the Barrier (usecs) Processes on Same Socket 1.5 Processes on Different Sockets 1 0.5 0 2 4 Processes 18 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010
Barrier As a function of number of sockets (1,4) – Smoky 2 Latency of the Barrier (usecs) Message Traffic within Socket Message Traffic between Sockets 1.5 1 0.5 0 4 Processes 19 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010
Summary • Added hardware support for offloading collective operations • Developed MPI-level support for asynchronous collectives • Good barrier performance • Good overlap capabilities • Work is continuing 20 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010
Recommend
More recommend