optimized mpi gather collective for many integrated core
play

Optimized MPI Gather Collective for Many Integrated Core (MIC) - PowerPoint PPT Presentation

Optimized MPI Gather Collective for Many Integrated Core (MIC) InfiniBand Clusters Akshay Venkatesh Krishna Kandalla Dhabaleswar K. Panda Network-based Computing Laboratory Department of Computer Science and Engineering The Ohio State


  1. Optimized MPI Gather Collective for Many Integrated Core (MIC) InfiniBand Clusters Akshay Venkatesh Krishna Kandalla Dhabaleswar K. Panda Network-based Computing Laboratory Department of Computer Science and Engineering The Ohio State University

  2. Outline • Introduction • Problem Statement • Designs • Experimental Evaluation and Analyses • Conclusion and Future Work 2

  3. Scientific applications, Accelerators and MPI • Several areas such as medical sciences, atmospheric research and earthquake simulations rely on speed of computations for better prediction/ analysis. • Many instances of applications benefitting from use of large scale • Accelerators/ Coprocessors further increase computational speed and increase energy efficiency • MPI continues to be widely used as HPC reels in heterogeneous architectures 3

  4. MVAPICH2/MVAPICH2-X Software • MPI(+X) continues to be the predominant programming model in HPC • High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE) – MVAPICH (MPI-1) ,MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002 – MVAPICH2-X (MPI + PGAS), Available since 2012 – Used by more than 2,055 organizations (HPC Centers, Industry and Universities) in 70 countries – More than 181,000 downloads from OSU site directly – Empowering many TOP500 clusters 6 th ranked 462,462-core cluster (Stampede) at TACC • • 19 th ranked 125,980-core cluster (Pleiades) at NASA • 21st ranked 73,278-core cluster (Tsubame 2.0) at Tokyo Institute of Technology and many others – Available with software stacks of many IB, HSE, and server vendors including Linux Distros (RedHat and SuSE) – http://mvapich.cse.ohio-state.edu • Partner in the U.S. NSF-TACC Stampede System 4 XSCALE13

  5. Latest Version of MVAPICH2 • Support for GPU-Aware MPI • Optimized and tuned point-to-point operations involving GPU Buffers • Support of GPU Direct RDMA • Optimized GPU Collectives • Ongoing effort to design a high performance library that enables MPI communication in MIC clusters 5

  6. Intel Xeon Phi Specifications • Belongs to the Many Integrated Core (MIC) family • 61 cores on the chip (each running at 1 GHz) • 4 hardware threads (smart round robin scheduler) • 1 Teraflop peak throughput and energy efficient • X_86 compatible and supports OpenMP, MPI, Cilk, etc. • Installed in the compute node as PCI Express device 6

  7. MPI on MIC Clusters (1) • Stampede has ~6,000 MIC Coprocessors • Tianhe-2 has ~48,000 MIC Coprocessors • Through MPSS*, MIC Coprocessors can directly use IB HCAs through peer-to-peer PCI communication for inter/intra-node communication • MPI is the predominantly used to make use of multiple such compute nodes in tandem *MPSS – Intel Manycore Platfrom Sofware Stack 7

  8. MPI on MIC Clusters(2) • MIC supports various modes of operation – Offload mode – Coprocessor-only or Native mode – Symmetric mode • Non-uniform host and destination platforms – Host to MIC, MIC to host, MIC-to-MIC • Transfers involving the MIC incurs additional cost owing to expensive PCIe path 8

  9. Symmetric mode and Implications • Non-uniform host and destination platforms – Host to MIC – MIC to host – MIC to MIC • Transfers involving the MIC incurs additional cost owing to expensive PCIe path • Performance is non-uniform 9

  10. Symmetric mode and Implications 18 x 1000 16 host->remote_host 14 host->remote_mic Latency (us) 12 mic->remote_host 10 8 mic->remote_mic 6 4 2 0 256K 512K 1M 2M 4M 8M 16M Message Size (Bytes) • MIC->MIC Latency = 8 X Host->Host Latency • Host->IB NIC bandwidth = 6 X MIC->IB NIC Bandwidth 10

  11. MPI on MIC Clusters(2) • MIC supports various modes of operation – Offload mode – Coprocessor-only or Native mode – Symmetric mode • Besides point-to-point primitives, MPI Standard also defines a set of collectives such as: – MPI_Bcast – MPI_Gather 11

  12. MPI_Gather ROOT 0 0 1 2 3 1 2 3 1 2 3 • Gather used in – Multi-agent heuristics – Mini applications – Can be used for reduction operations and more 12

  13. MPI Gather(2) • MPI_Gather – One root process receives data from every other process • On homogeneous systems the collective adopts – Linear scheme – Binomial scheme – Hierarchical scheme 13

  14. Typical MPI_Gather on MIC Clusters HCA HCA • Yellow grid boxes represent host Host 0 Host 1 processors MIC 0 MIC 1 • Blue grid boxes Node 0 Node 1 represent MIC coprocessors HCA HCA Host 2 Host 3 MIC 2 MIC 3 Node 2 Node 3 14

  15. Typical MPI_Gather on MIC Clusters HCA HCA • Hierarchical scheme or Leader-based Host 0 Host 1 scheme MIC 0 MIC 1 • Communicator per Node 0 Node 1 node • Leader is least rank in node HCA HCA Host 2 Host 3 Local - Leader MIC 2 MIC 3 Node 2 Node 3 15

  16. Typical MPI_Gather on MIC Clusters HCA HCA • Local leader on the MIC directly uses Host 0 Host 1 the NIC MIC 0 MIC 1 • IB reading from MIC Node 0 Node 1 is costly • When root of the gather is on HCA HCA MIC, there are transfers from local Host 2 Host 3 MIC to remote MIC • This can be very MIC 2 MIC 3 costly Node 2 Node 3 16

  17. Outline • Introduction • Problem Statement • Designs • Experimental Evaluation and Analyses • Conclusion and Future Work 17

  18. Problem Statement • What are the primary bottlenecks that affect the performance of the MPI Gather operation on MIC clusters? • Can we design algorithms to overcome architecture specific performance deficits to improve gather latency? • Can we analyze and quantify the benefits of our proposed approach using micro-benchmarks? 18

  19. MPI Collectives on MIC • Primitive operations such as MPI_Send, MPI_Recv and their non-blocking counterparts have been optimized* • MPI Collectives such MPI_Alltoall, MPI_Scatter, etc which are designed on top of p2p primitives immediately benefit from such optimizations • Can we do better? *S. Potluri, A. Venkatesh, D. Bureddy, K. Kandalla and D. K. Panda: Efficient Intra-node Communication on Intel-MIC Clusters , CCGrid’13 19

  20. Outline • Introduction • Problem Statement • Designs • Experimental Evaluation and Analyses • Conclusion and Future Work 20

  21. Design Goals • Avoid IB reading from MIC – Especially large transfers • Use pipelining methods • Overlap operations when possible 21

  22. Designs • 3-level hierarchical algorithm • Pipelined Algorithm • Overlapped 3-Level-hierarchical algorithm 22

  23. Designs • 3-level hierarchical algorithm • Pipelined Algorithm • Overlapped 3-Level-hierarchical algorithm 23

  24. Design1: 3-Level-Hierarchical Algorithm HCA HCA • Step 1: Same as the default hierarchical Host 0 Host 1 or leader based scheme MIC 0 MIC 1 Node 0 Node 1 HCA HCA Host 2 Host 3 Local - Leader MIC 2 MIC 3 Node 2 Node 3 24

  25. Design1: 3-Level-Hierarchical Algorithm HCA HCA • Step 2: Transfer of MIC-aggregated Host 0 Host 1 data to host over PCI MIC 0 MIC 1 Node 0 Node 1 • Difference? SCIF is used and performance is relatively competent HCA HCA Host 2 Host 3 MIC 2 MIC 3 Node 2 Node 3 25

  26. Design1: 3-Level-Hierarchical Algorithm • Advantage: Step 3 HCA HCA which involves transferring the Host 0 Host 1 large aggregate message does not MIC 0 MIC 1 involve any IB reads Node 0 Node 1 from MIC • Disadvantage: MIC cores are slow and HCA HCA hence Intra-MIC gathers are slower Host 2 Host 3 • => Host leader MIC 2 MIC 3 needs to wait Node 2 Node 3 26

  27. Designs • 3-level hierarchical algorithm • Pipelined Algorithm • Overlapped 3-Level-hierarchical algorithm 27

  28. Design 2: Pipelined Gather • Leader on a each host posts non- HCA HCA blocking recvs from all processes within Host 0 Host 1 the node MIC 0 MIC 1 • Leader sends its Node 0 Node 1 own data to gather root • Each process within HCA HCA a node sends to leader on host Host 2 Host 3 • Host forwards to MIC 2 MIC 3 gather root Node 2 Node 3 28

  29. Design 2: Pipelined Gather • Advantage: None of HCA HCA the steps involve IB reading from MIC Host 0 Host 1 • Disadvantage: Majority of transfer MIC 0 MIC 1 burden lies on Node 0 Node 1 leader of host nodes • => Processing non- HCA HCA blocking receives on Host leader can turn Host 2 Host 3 into a bottleneck MIC 2 MIC 3 Node 2 Node 3 29

  30. Designs • 3-level hierarchical algorithm • Pipelined Algorithm • Overlapped 3-Level-hierarchical algorithm 30

  31. Design 3: 3-Level-Hierarchical Overlapped Algorithm HCA HCA • Step 1: Same as the default hierarchical Host 0 Host 1 or leader based scheme MIC 0 MIC 1 Node 0 Node 1 HCA HCA Host 2 Host 3 Local - Leader MIC 2 MIC 3 Node 2 Node 3 31

  32. Design 3: 3-Level-Hierarchical Overlapped Algorithm HCA HCA • Host leader first posts non-blocking Host 0 Host 1 receive from MIC leader in the node MIC 0 MIC 1 Node 0 Node 1 • Host leader gathers data from host processes and sends to root HCA HCA • Meanwhile MIC Host 2 Host 3 leader starts sending data to host MIC 2 MIC 3 leader Node 2 Node 3 32

Recommend


More recommend