hierarchy aware blocking and nonblocking collective
play

Hierarchy Aware Blocking and Nonblocking Collective - PowerPoint PPT Presentation

Hierarchy Aware Blocking and Nonblocking Collective Communications-The Effects of Shared Memory Communications in the Cray XT Environment Richard L. Graham, Joshua S. Ladd, Manjunath Venkata 1 Managed by UT-Battelle 1 Managed by


  1. Hierarchy Aware Blocking and Nonblocking Collective Communications-The Effects of Shared Memory Communications in the Cray XT Environment Richard L. Graham, Joshua S. Ladd, Manjunath Venkata 1 Managed by UT-Battelle 1 Managed by UT-Battelle for the Department of Energy for the Department of Energy Graham_CAC_2010 Graham_CAC_2010

  2. Acknowledgements • US Department of Energy FASTOS program 2 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010

  3. Outline • Statement of the problem • Design Overview • Results • Next steps 3 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010

  4. Problems being addressed • Optimization of collective operations • Implementation of extensible optimized collective operations • Implementation of nonblocking collective operations 4 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010

  5. Why Optimize Collective Communications • Collective operations limit application scalability • Communication pattern involving multiple processes (in MPI, all ranks in the communicator are involved) • Optimized collectives involve a communicator-wide data-dependent communication pattern • Data needs to be manipulated at intermediate stages of a collective operation • Collective operations magnify the effects of system- noise 5 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010

  6. Scalability of Collective Operations Ideal Algorithm Impact of System Noise 3'&/ : 3'&/ : ,)75(61'. ,)75(61'. 4'225.1(-61'. 4'225.1(-61'. 8*)+,)*596 8*)+,)*596 ;'1*) $ 012) 012) $ ! " # $ ! " # $ %&'()**+,-./ %&'()**+,-./ 6 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010

  7. Scalability of Collective Operations - II Offloaded Algorithm Nonblocking Algorithm 3'&/ = 3'&/ = ,)75(61'. ,)75(61'. 4'225.1(-61'. 4'225.1(-61'. 8*)+,)*596 8*)+,)*596 :)9);-61'.+<;).6 :)9);-61'.+<;).6 012) $ $ 012) ! " # $ ! " # $ %&'()**+,-./ %&'()**+,-./ 7 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010

  8. Mapping the collectives onto the system • Consider communication hierarchies • Schedule the network 8 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010

  9. Example – 4 Process Recursive Doubling Host 1 Host 2 1 2 3 4 Step 1 1 2 3 4 Inter Host Step 2 Communication 1 2 3 4 9 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010

  10. Example – 4 Process Recursive Doubling – On host optimization Host 1 Host 2 1 2 3 4 Step 1 1 2 3 4 Inter Host Step 2 Communication 1 2 3 4 Step 3 1 2 3 4 10 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010

  11. Design strategy • Decouple – Hierarchy detection – Network specific collective algorithm implementation (“single” level) – Full collective function implementation (hierarchical) – Basic building blocks from MPI level functions • Share resources between levels w/o breaking the abstraction between layers 11 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010

  12. Collectives – Software Layers OMPI Module Component Architecture Collective Framework Basic Collectives (bcol) Framework Subgroup Framework SM NUMA MUMA IBNET Pt2Pt ML – Hierarchical Tuned (pt2pt) IB Collectives Comp. Collectives Comp. OFFLOAD MLNX OFED 12 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010

  13. Benchmarks 13 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010

  14. System setup • Jaguar • 2.6 GHz Istanbul processor • Dual socket • Hex-core • Smoky – 2.0 GHz Opteron – Quad socket – Quad core 14 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010

  15. Barrier as a function of Process count – Jaguar – 2 Level hierarchy 9 Shared Memory pt-2-pt 8 Latency of the Barrier (usecs) 7 6 5 4 3 2 1 0 2 4 6 8 10 12 Processes 15 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010

  16. Barrier as a function of Process count – Smoky – 2 Level hierarchy 12 Shared Memory pt-2-pt Latency of the Barrier (usecs) 10 8 6 4 2 0 2 4 6 8 10 12 14 16 Processes 16 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010

  17. Barrier As a function of number of sockets - Jaguar 2 Latency of the Barrier (usecs) Processes on Same Socket 1.5 Processes on Different Sockets 1 0.5 0 2 4 Processes 17 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010

  18. Barrier As a function of number of sockets (1,2) – Smoky 2 Latency of the Barrier (usecs) Processes on Same Socket 1.5 Processes on Different Sockets 1 0.5 0 2 4 Processes 18 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010

  19. Barrier As a function of number of sockets (1,4) – Smoky 2 Latency of the Barrier (usecs) Message Traffic within Socket Message Traffic between Sockets 1.5 1 0.5 0 4 Processes 19 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010

  20. Summary • Added hardware support for offloading collective operations • Developed MPI-level support for asynchronous collectives • Good barrier performance • Good overlap capabilities • Work is continuing 20 Managed by UT-Battelle for the Department of Energy Graham_CAC_2010

Recommend


More recommend