Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning Ching-Hsiang Chu 1 , Xiaoyi Lu 1 , Ammar A. Awan 1 , Hari Subramoni 1 , Jahanzeb Hashmi 1 , Bracy Elton 2 and Dhabaleswar K. (DK) Panda 1 1 Department of Computer Science and Engineering, The Ohio State University 2 Engility Corporation
Outline • Introduction – Deep Learning on GPU and InfiniBand (IB) Clusters – Multi-source Broadcast-type Operation for Deep Learning • Analysis • Proposed Design – Streaming-based Design with IB multicast and NVIDIA GPUDirect features • Performance Evaluation • Conclusion and Future Work Network Based Computing Laboratory ICPP 2017 2
Trends in Modern HPC Architecture Accelerators / Coprocessors High Performance Interconnects – high compute density, high InfiniBand (IB), Omni-Path SSD, NVMe-SSD, NVRAM performance/watt Multi-core Processors < 1 μsec latency, 100 Gbps Bandwidth> > 1 Tflop/s DP on a chip • Multi-core/many-core technologies • High Performance Interconnects • Accelerators/Coprocessors are becoming common in high-end systems • High Performance Storage and Compute devices K - Computer Tianhe – 2 Sunway TaihuLight Titan Network Based Computing Laboratory ICPP 2017 3
GPU in HPC Systems • Growth of GPU clusters in the last 3 years – NVIDIA GPUs boost many Top 500 and Green 500 systems • “Top 13 systems on the latest Green500 are all equipped with the P100 hardware”* NVIDIA Fermi NVIDIA Kepler NVIDIA Pascal 80 *Data collected from http://top500.org 70 60 22 2 System Count 50 40 52 53 50 28 23 33 30 43 20 10 20 18 15 14 10 8 6 0 June-2014 Nov-2014 June-2015 Nov-2015 June-2016 Nov-2016 June-2017 Network Based Computing Laboratory ICPP 2017 4
Architectures for Deep Learning (DL) Past and Current Trend Near-future Multi-core CPUs within a node Multi-core CPUs + Multi-GPU within a Multi-core CPUs + Multi-GPU node across nodes IB Networks Multi-core CPUs across nodes Multi-core CPUs + Single GPU across nodes IB Networks IB Networks E.g., NVIDIA DGX-1 systems Network Based Computing Laboratory ICPP 2017 5
High-performance Deep Learning • Computation using GPU GPU Node 1 GPU Node 3 • Communication using MPI – Exchanging partial gradients after each minibatch – All-to-all (Multi-Source) communications GPU Node 4 GPU Node 2 Ø E.g., MPI_Bcast • Challenges – High computation-communication overlap – Good scalability for upcoming large-scale GPU clusters – No application-level modification Network Based Computing Laboratory ICPP 2017 6
Outline • Introduction • Analysis – Existing Designs – Problem Statement • Proposed Design • Performance Evaluation • Conclusion and Future Work Network Based Computing Laboratory ICPP 2017 7
Evaluation Parameters Message Notation Meaning Unit 𝑵 𝒐 Number of processes N/A 𝒏 Number of broadcast sources N/A 𝑫 𝒖 𝒕 Set up time for sending data sec 𝒖 𝒑 (𝒐) Overhead for issuing an IB-MCAST packet sec 𝑽 𝑵 Original message size bytes Bandwidth 𝑫 Size of a data chunk bytes Maximum Transmission Unit for IB-MCAST, 𝑪 𝑰 ≫ 𝑪 𝑯 𝑽 bytes CPU provided by hardware manufacturer 𝑪 𝑰 Bandwidth of reading Host memory bytes/sec IB HCA 𝑪 𝑸𝑫𝑱𝒇 Bandwidth of reading GPU memory 𝑪 𝑯 bytes/sec (NVIDIA GPUDirect RDMA) GPU PCIe Bandwidth between Host and GPU 𝑪 𝑸𝑫𝑱𝒇 bytes/sec 𝑪 𝑯 memory Network Based Computing Laboratory ICPP 2017 8
Ring-based Broadcast • Direct • Staging • Pipeline 𝑁 𝐷 + (𝑜 − 2) × 𝑢 7 + 𝐷 (𝑜 − 1)× 𝑢 7 + 𝑁 𝑁 + (𝑜 − 1)× 𝑢 7 + 𝑁 𝐶 ; 𝐶 ; 𝐶 >?@A 𝐶 B Poor GDR Read Scalability GDR Write Network Transfer Source Destination 1 Destination 2 Destination 3 GPU CPU CPU IB HCA IB HCA Data IB HCA Data Data GPU IB HCA CPU GPU CPU GPU Data Network Based Computing Laboratory ICPP 2017 9
K-nomial-based Broadcast • Direct • Staging • Pipeline 𝑁 × 𝑢 7 + 𝐷 𝑁 + log F 𝑜 × 𝑢 7 + 𝑁 log F 𝑜 × 𝑢 7 + 𝑁 𝐷 × log F 𝑜 𝐶 ; 𝐶 ; 𝐶 >?@A 𝐶 B Destination 1 Non-optimized GPU CPU Source Scalability Data CPU IB HCA IB HCA GPU Destination 2 Destination 3 Data CPU IB HCA GDR Read IB HCA GDR Write Data GPU CPU GPU Data Network Transfer Network Based Computing Laboratory ICPP 2017 10
Hardware Multicast-based Broadcast* • For GPU-resident data, using – GPUDirect RDMA (GDR) Destination 1 – InfiniBand Hardware Multicast (IB-MCAST) Header CPU IB • Overhead Source HCA Header GPU – IB UD limit CPU Data IB IB – GDR limit Switch HCA Destination N GPU 𝑁 𝑉 × 𝑢 7 + 𝑢 H (𝑜) + 𝑉 Data Header CPU 𝐶 ; IB 1. IB Gather + GDR Read HCA 2. IB Hardware Multicast *A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. GPU 3. IB Scatter + GDR Write Panda, “A High Performance Broadcast Design with Hardware Data Multicast and GPUDirect RDMA for Streaming Applications on InfiniBand Clusters,” in HiPC 2014 , Dec 2014. Network Based Computing Laboratory ICPP 2017 11
Problem Statement • How to determine techniques to leverage IB-MCAST and other GPU advanced features GDR to design efficient and scalable broadcast with large messages on GPU clusters? • How to achieve high overlap and scalability for multi-source broadcast operations? • How to determine attainable theoretical and practical performance benefits for deep learning applications? Network Based Computing Laboratory ICPP 2017 12
Outline • Introduction • Analysis • Proposed Design – Streaming-based Design with IB multicast and NVIDIA GPUDirect features • Performance Evaluation • Conclusion and Future Work Network Based Computing Laboratory ICPP 2017 13
Overview of Proposed Streaming Design • Optimized broadcast send operation – Streaming the GPU-resident data through host memory – Leveraging InfiniBand hardware multicast Ø Low-latency: avoiding GDR Read limit Ø Overlapping data transfers within and across nodes • Optimized broadcast receive operation – Zero-copy scheme by leveraging GDR feature Ø Low-latency: avoiding unnecessary data transfers Network Based Computing Laboratory ICPP 2017 14
Optimized Broadcast Send • Preparing Intermediate buffer ( im_buf ) – Page-locked (pinned) host buffer Ø Fast Device-Host data movement MPI_Bcast(d_out,…) Source – Allocated at initialization phase Header im_buf Ø Low overhead CPU IB IB • Streaming data through host Switch HCA GPU – Fine-tuned chunked data d_out – Asynchronous copy operations 1. Data Preparation 2. IB Gather Ø Three-stage pipeline 3. IB Hardware Multicast Network Based Computing Laboratory ICPP 2017 15
Optimized Broadcast Receive MPI_Bcast(d_in,…) • Zero-copy broadcast receive Destination 1 Header – Pre-posted user buffer (d_in) CPU IB – Avoids additional data movement HCA GPU IB – Leverages IB Scatter and GDR d_in Switch features Destination N Ø Low-latency Header CPU IB Ø Free-up PCIe resources for HCA applications GPU d_in IB Hardware Multicast IB Scatter (GDR Write) Network Based Computing Laboratory ICPP 2017 16
Overlap Opportunities : cudaMemcpyAsync : IB Hardware Multicast : cudaStreamSynchronize Overlap within a node 𝐷 + 𝑁 𝑉 × 𝑢 7 + 𝑢 H (𝑜) + 𝑉 : GDR Write 𝐶 >?@A 𝐶 B Broadcast from Node A GPU Node A CPU HCA GPU Node B CPU HCA HCA Node C CPU GPU Broadcast from Node B Broadcast from Node C Timeline Overlap Across Nodes Network Based Computing Laboratory ICPP 2017 17
Outline • Introduction • Analysis • Proposed Design • Performance Evaluation – OSU Micro-Benchmark (OMB) – Deep Learning Framework • Conclusion and Future Work Network Based Computing Laboratory ICPP 2017 18
Overview of the MVAPICH2 Project • High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015 – Used by more than 2,775 organizations in 85 countries – More than 420,000 (> 0.4 million) downloads from the OSU site directly – Empowering many TOP500 clusters (June ‘17 ranking) • 1st, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China • 15th, 241,108-core (Pleiades) at NASA • 20th, 462,462-core (Stampede) at TACC • 44th, 74,520-core (Tsubame 2.5) at Tokyo Institute of Technology – Available with software stacks of many vendors and Linux Distros (RedHat and SuSE) – http://mvapich.cse.ohio-state.edu • Empowering Top500 systems for over a decade System-X from Virginia Tech (3 rd in Nov 2003, 2,200 processors, 12.25 TFlops) -> – Sunway TaihuLight (1 st in Jun’16, 10M cores, 100 PFlops) – Network Based Computing Laboratory ICPP 2017 19
Recommend
More recommend