collect ollectiv ive e fr framew amewor ork k and and per
play

Collect ollectiv ive e Fr Framew amewor ork k and and Per - PowerPoint PPT Presentation

Collect ollectiv ive e Fr Framew amewor ork k and and Per erfor ormance mance Optimiz Opt imizat ation ion to o Open Open MPI for or Cray ay XT 5 5 platfor plat orms ms Cray Users Group 2011 1 Managed by UT-Battelle 1


  1. Collect ollectiv ive e Fr Framew amewor ork k and and Per erfor ormance mance Optimiz Opt imizat ation ion to o Open Open MPI for or Cray ay XT 5 5 platfor plat orms ms Cray Users Group 2011 1 Managed by UT-Battelle 1 Managed by UT-Battelle for the Department of Energy for the Department of Energy Graham_OpenMPI_SC08 Graham_OpenMPI_SC08

  2. Collectives are Critical for HPC Application Performance • A large percentage of application execution time is spent in the global synchronization operations (collectives) • Moving towards exascale systems (million processor cores), the time spent in collectives only increases • Performance and scalability of HPC applications requires efficient and scalable collective operations 2 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

  3. Weakness in current Open MPI implementation Open MPI lacks support for • Customized collective implementation for arbitrary communication hierarchies • Concurrent progress of collectives on different communication hierarchies • Nonblocking collectives • Taking advantage of capabilities of recent network interfaces (example offload capabilities) • Efficient point-to-point message protocol for Cray XT platforms 3 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

  4. Cheetah : A Framework for Scalable Hierarchical Collectives Goals of the framework • Provide building blocks for implementing collectives for arbitrary communication hierarchy • Support collectives tailored to the communication hierarchy • Support both blocking and nonblocking collectives efficiently • Enable building collectives customized for the hardware architecture 4 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

  5. Cheetah Framework : Design principles • Collective operation is split into collective primitives over different communication hierarchies • Collective primitives over the different hierarchies are allowed to progress concurrently • Decouple the topology of a collective operation from the implementation, enabling the reusability of primitives • Design decisions are driven by nonblocking collective design, blocking collectives are a special case of nonblocking ones • Use Open MPI component architecture 5 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

  6. Cheetah is Implemented as a Part of Open MPI OMPI BCOL COLL SBGP BASESOCKET BASEMUMA BASEMUMA IBOFFLOAD PTPCOLL DEFAULT IBNET P2P ML Cheetah Components Open MPI Components 6 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

  7. Cheet heetah ah Component omponents and and it its Funct Functions ions • Base Collectives (BCOL) – Implements basic collective primitives • Subgrouping (SBGP) – Provides rules for grouping the processes • Multilevel (ML) – Coordinates collective primitive execution, manages data and control buffers, and maps MPI semantics to BCOL primitives • Schedule – Defines the collective primitives that are part of collective operation • Progress Engine – Responsible for starting, progressing and completing the collective primitives 7 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

  8. BCOL Component – Base collective primitives • Provides collective primitives that are optimized for certain communication hierarchies – BASESMUMA: Shared memory – P2P: SeaStar 2+, Ethernet, InfiniBand – IBNET: ConnectX-2 • A collective operation is implemented as a combination of these primitives – Example, n level Barrier can be a combination of Fanin ( first n-1 levels), Barrier (n th level) and Fanout ( first n-1 levels) 8 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

  9. SBGP Component – Group the Processes Based on the Communication Hierarchy P2P Subgroup UMA Subroup UMA Group Leader Socket Subroups Socket Group Leader CPU Socket Allocated Core Unallocated Core Node 1 Node 2 9 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

  10. Open MPI portals BTL optimization Sender MPI Process Receiver MPI Process MPI Message Open MPI Message Portals Message X Ack Portal acknowledgment is not required for Cray XT 5 platforms as they use Basic End to End Protocol (BEER) for message transfer 10 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

  11. Experimental Setup • Hardware : Jaguar – 18,688 Compute Nodes – 2.6 GHz AMD Opteron (Istanbul) – SeaStar 2+ Routers connected in a 3D torus topology • Benchmarks : – Point-to-Point : OSU Latency and Bandwidth – Collectives : • Broadcast in a tight loop • Barrier in a tight loop 11 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

  12. 1 1 Byte e Open Open MPI P2P 2P Lat Latency ency is is 15% bet 15% better er than han Cray ay MPI OMPI vs CRAY portals latency 110 OMPI with portals optimization OMPI without portals optimization 100 Cray MPI 90 80 70 Latency (Usec) 60 50 40 30 20 10 0 1 10 100 1000 10000 100000 1e+06 Message size (bytes) 12 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

  13. Open Open MPI and and Cray ay MPI bandw bandwidt idth h sat atur urat ate e at at ~2 2 Gbp/ Gbp/s OMPI vs CRAY portals bandwidth 2500 OMPI with portals optimization OMPI without portals optimization Cray MPI 2000 Bandwidth (Mb/s) 1500 1000 500 0 1 10 100 1000 10000 100000 1e+06 1e+07 Message size (bytes) 13 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

  14. Hierarchical Collective Algorithms 14 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

  15. Flat Barrier Algorithm Host 1 Host 2 1 2 3 4 Step 1 1 2 3 4 Inter Host Step 2 Communication 1 2 3 4 15 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

  16. Hierarchical Barrier Algorithm Host 1 Host 2 1 2 3 4 Step 1 1 2 3 4 Inter Host Step 2 Communication 1 2 3 4 Step 3 1 2 3 4 16 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

  17. Cheet heetah’ ah’s Bar arrier ier Collect ollectiv ive e Out Outper perfor orms ms the he Cray ay MPI Bar arrier ier by by 10% 10% 140 Cheetah Cray MPI 120 100 Latency (microsec.) 80 60 40 20 0 0 2000 4000 6000 8000 10000 12000 14000 MPI Processes 17 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

  18. Data Flow in a Hierarchical Broadcast Algorithm S NODE 1 NODE 2 S Source of the Broadcast 18 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

  19. Hierarchical Broadcast Algorithms • Knownroot Hierarchical Broadcast – the suboperations are ordered based on the source of data – the suboperations are concurrently started after the execution of suboperation with the source of broadcast – uses k-nomial tree for data distribution • N-ary Hierarchical Broadcast – same as Knownroot algorithm but uses N-ary tree for data distribution • Sequential Hierarchical Broadcast – the suboperations are ordered sequentially – there is no concurrent execution 19 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

  20. Cheet heetah’ ah’s Broadcas oadcast Collect ollectiv ive e Out Outper perfor orms ms the he Cray ay MPI Broadcas oadcast by by 10% 10% (8 8 Byte) e) 90 80 70 60 Latency (microsec.) 50 40 30 20 Cray MPI Cheetah three level known k-nomial 10 Cheetah three level known n-ary Cheetah three level sequential bcast 0 0 5000 10000 15000 20000 25000 MPI Processes 20 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

  21. Cheet heetah’ ah’s Broadcas oadcast Collect ollectiv ive e Out Outper perfor orms ms the he Cray ay MPI Broadcas oadcast by by 92% 92% (4 4 KB KB) 200 150 Latency (microsec.) 100 50 Cray MPI Cheetah three-level known k-nomial Cheetah three-level known NB n-ary Cheetah three-level known NB k-nomial Cheetah sequential bcast 0 0 10000 20000 30000 40000 50000 MPI Processes 21 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

  22. Cheet heetah’ ah’s Broadcas oadcast Collect ollectiv ive e Out Outper perfor orms ms the he Cray ay MPI Broadcas oadcast by by 9% 9% (4 4 MB) 55000 50000 45000 40000 Latency (Usec) 35000 30000 25000 Cray MPI 20000 Cheetah three level known k-nomial Cheetah three level known n-ary Cheetah three level sequential bcast 15000 0 5000 10000 15000 20000 25000 MPI Processes 22 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

  23. Summary • Cheetah’s Broadcast is 92% better than the Cray MPI’s Broadcast • Cheetah’s Barrier outperforms Cray MPI’s Barrier by 10% • Open MPI point-to-point message latency is 15% better than the Cray MPI (1 byte message) • The key to the performance and scalability of the collective operations – Concurrent execution of sub-operations – Scalable resource usage techniques – Asynchronous semantics and progress – Customized collective primitives for each of communication hierarchy 23 Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08

Recommend


More recommend