Optimized MPI Gather Collective for Many Integrated Core (MIC) InfiniBand Clusters Akshay Venkatesh Krishna Kandalla Dhabaleswar K. Panda Network-based Computing Laboratory Department of Computer Science and Engineering The Ohio State University
Outline • Introduction • Problem Statement • Designs • Experimental Evaluation and Analyses • Conclusion and Future Work 2
Scientific applications, Accelerators and MPI • Several areas such as medical sciences, atmospheric research and earthquake simulations rely on speed of computations for better prediction/ analysis. • Many instances of applications benefitting from use of large scale • Accelerators/ Coprocessors further increase computational speed and increase energy efficiency • MPI continues to be widely used as HPC reels in heterogeneous architectures 3
MVAPICH2/MVAPICH2-X Software • MPI(+X) continues to be the predominant programming model in HPC • High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE) – MVAPICH (MPI-1) ,MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002 – MVAPICH2-X (MPI + PGAS), Available since 2012 – Used by more than 2,055 organizations (HPC Centers, Industry and Universities) in 70 countries – More than 181,000 downloads from OSU site directly – Empowering many TOP500 clusters 6 th ranked 462,462-core cluster (Stampede) at TACC • • 19 th ranked 125,980-core cluster (Pleiades) at NASA • 21st ranked 73,278-core cluster (Tsubame 2.0) at Tokyo Institute of Technology and many others – Available with software stacks of many IB, HSE, and server vendors including Linux Distros (RedHat and SuSE) – http://mvapich.cse.ohio-state.edu • Partner in the U.S. NSF-TACC Stampede System 4 XSCALE13
Latest Version of MVAPICH2 • Support for GPU-Aware MPI • Optimized and tuned point-to-point operations involving GPU Buffers • Support of GPU Direct RDMA • Optimized GPU Collectives • Ongoing effort to design a high performance library that enables MPI communication in MIC clusters 5
Intel Xeon Phi Specifications • Belongs to the Many Integrated Core (MIC) family • 61 cores on the chip (each running at 1 GHz) • 4 hardware threads (smart round robin scheduler) • 1 Teraflop peak throughput and energy efficient • X_86 compatible and supports OpenMP, MPI, Cilk, etc. • Installed in the compute node as PCI Express device 6
MPI on MIC Clusters (1) • Stampede has ~6,000 MIC Coprocessors • Tianhe-2 has ~48,000 MIC Coprocessors • Through MPSS*, MIC Coprocessors can directly use IB HCAs through peer-to-peer PCI communication for inter/intra-node communication • MPI is the predominantly used to make use of multiple such compute nodes in tandem *MPSS – Intel Manycore Platfrom Sofware Stack 7
MPI on MIC Clusters(2) • MIC supports various modes of operation – Offload mode – Coprocessor-only or Native mode – Symmetric mode • Non-uniform host and destination platforms – Host to MIC, MIC to host, MIC-to-MIC • Transfers involving the MIC incurs additional cost owing to expensive PCIe path 8
Symmetric mode and Implications • Non-uniform host and destination platforms – Host to MIC – MIC to host – MIC to MIC • Transfers involving the MIC incurs additional cost owing to expensive PCIe path • Performance is non-uniform 9
Symmetric mode and Implications 18 x 1000 16 host->remote_host 14 host->remote_mic Latency (us) 12 mic->remote_host 10 8 mic->remote_mic 6 4 2 0 256K 512K 1M 2M 4M 8M 16M Message Size (Bytes) • MIC->MIC Latency = 8 X Host->Host Latency • Host->IB NIC bandwidth = 6 X MIC->IB NIC Bandwidth 10
MPI on MIC Clusters(2) • MIC supports various modes of operation – Offload mode – Coprocessor-only or Native mode – Symmetric mode • Besides point-to-point primitives, MPI Standard also defines a set of collectives such as: – MPI_Bcast – MPI_Gather 11
MPI_Gather ROOT 0 0 1 2 3 1 2 3 1 2 3 • Gather used in – Multi-agent heuristics – Mini applications – Can be used for reduction operations and more 12
MPI Gather(2) • MPI_Gather – One root process receives data from every other process • On homogeneous systems the collective adopts – Linear scheme – Binomial scheme – Hierarchical scheme 13
Typical MPI_Gather on MIC Clusters HCA HCA • Yellow grid boxes represent host Host 0 Host 1 processors MIC 0 MIC 1 • Blue grid boxes Node 0 Node 1 represent MIC coprocessors HCA HCA Host 2 Host 3 MIC 2 MIC 3 Node 2 Node 3 14
Typical MPI_Gather on MIC Clusters HCA HCA • Hierarchical scheme or Leader-based Host 0 Host 1 scheme MIC 0 MIC 1 • Communicator per Node 0 Node 1 node • Leader is least rank in node HCA HCA Host 2 Host 3 Local - Leader MIC 2 MIC 3 Node 2 Node 3 15
Typical MPI_Gather on MIC Clusters HCA HCA • Local leader on the MIC directly uses Host 0 Host 1 the NIC MIC 0 MIC 1 • IB reading from MIC Node 0 Node 1 is costly • When root of the gather is on HCA HCA MIC, there are transfers from local Host 2 Host 3 MIC to remote MIC • This can be very MIC 2 MIC 3 costly Node 2 Node 3 16
Outline • Introduction • Problem Statement • Designs • Experimental Evaluation and Analyses • Conclusion and Future Work 17
Problem Statement • What are the primary bottlenecks that affect the performance of the MPI Gather operation on MIC clusters? • Can we design algorithms to overcome architecture specific performance deficits to improve gather latency? • Can we analyze and quantify the benefits of our proposed approach using micro-benchmarks? 18
MPI Collectives on MIC • Primitive operations such as MPI_Send, MPI_Recv and their non-blocking counterparts have been optimized* • MPI Collectives such MPI_Alltoall, MPI_Scatter, etc which are designed on top of p2p primitives immediately benefit from such optimizations • Can we do better? *S. Potluri, A. Venkatesh, D. Bureddy, K. Kandalla and D. K. Panda: Efficient Intra-node Communication on Intel-MIC Clusters , CCGrid’13 19
Outline • Introduction • Problem Statement • Designs • Experimental Evaluation and Analyses • Conclusion and Future Work 20
Design Goals • Avoid IB reading from MIC – Especially large transfers • Use pipelining methods • Overlap operations when possible 21
Designs • 3-level hierarchical algorithm • Pipelined Algorithm • Overlapped 3-Level-hierarchical algorithm 22
Designs • 3-level hierarchical algorithm • Pipelined Algorithm • Overlapped 3-Level-hierarchical algorithm 23
Design1: 3-Level-Hierarchical Algorithm HCA HCA • Step 1: Same as the default hierarchical Host 0 Host 1 or leader based scheme MIC 0 MIC 1 Node 0 Node 1 HCA HCA Host 2 Host 3 Local - Leader MIC 2 MIC 3 Node 2 Node 3 24
Design1: 3-Level-Hierarchical Algorithm HCA HCA • Step 2: Transfer of MIC-aggregated Host 0 Host 1 data to host over PCI MIC 0 MIC 1 Node 0 Node 1 • Difference? SCIF is used and performance is relatively competent HCA HCA Host 2 Host 3 MIC 2 MIC 3 Node 2 Node 3 25
Design1: 3-Level-Hierarchical Algorithm • Advantage: Step 3 HCA HCA which involves transferring the Host 0 Host 1 large aggregate message does not MIC 0 MIC 1 involve any IB reads Node 0 Node 1 from MIC • Disadvantage: MIC cores are slow and HCA HCA hence Intra-MIC gathers are slower Host 2 Host 3 • => Host leader MIC 2 MIC 3 needs to wait Node 2 Node 3 26
Designs • 3-level hierarchical algorithm • Pipelined Algorithm • Overlapped 3-Level-hierarchical algorithm 27
Design 2: Pipelined Gather • Leader on a each host posts non- HCA HCA blocking recvs from all processes within Host 0 Host 1 the node MIC 0 MIC 1 • Leader sends its Node 0 Node 1 own data to gather root • Each process within HCA HCA a node sends to leader on host Host 2 Host 3 • Host forwards to MIC 2 MIC 3 gather root Node 2 Node 3 28
Design 2: Pipelined Gather • Advantage: None of HCA HCA the steps involve IB reading from MIC Host 0 Host 1 • Disadvantage: Majority of transfer MIC 0 MIC 1 burden lies on Node 0 Node 1 leader of host nodes • => Processing non- HCA HCA blocking receives on Host leader can turn Host 2 Host 3 into a bottleneck MIC 2 MIC 3 Node 2 Node 3 29
Designs • 3-level hierarchical algorithm • Pipelined Algorithm • Overlapped 3-Level-hierarchical algorithm 30
Design 3: 3-Level-Hierarchical Overlapped Algorithm HCA HCA • Step 1: Same as the default hierarchical Host 0 Host 1 or leader based scheme MIC 0 MIC 1 Node 0 Node 1 HCA HCA Host 2 Host 3 Local - Leader MIC 2 MIC 3 Node 2 Node 3 31
Design 3: 3-Level-Hierarchical Overlapped Algorithm HCA HCA • Host leader first posts non-blocking Host 0 Host 1 receive from MIC leader in the node MIC 0 MIC 1 Node 0 Node 1 • Host leader gathers data from host processes and sends to root HCA HCA • Meanwhile MIC Host 2 Host 3 leader starts sending data to host MIC 2 MIC 3 leader Node 2 Node 3 32
Recommend
More recommend