exploiting maximal overlap for non contiguous data
play

Exploiting Maximal Overlap for Non- Contiguous Data Movement - PowerPoint PPT Presentation

Exploiting Maximal Overlap for Non- Contiguous Data Movement Processing on Modern GPU-enabled Systems Ching-Hsiang Chu, Khaled Hamidouche, AkshayVenkatesh, Dip S. Banerjee, Hari Subramoni and Dhabaleswar K. (DK) Panda Network-based Computing


  1. Exploiting Maximal Overlap for Non- Contiguous Data Movement Processing on Modern GPU-enabled Systems Ching-Hsiang Chu, Khaled Hamidouche, AkshayVenkatesh, Dip S. Banerjee, Hari Subramoni and Dhabaleswar K. (DK) Panda Network-based Computing Laboratory Department of Computer Science and Engineering The Ohio State University

  2. Outline • Introduction • Proposed Designs • Performance Evaluation • Conclusion IPDPS 2016 2 Network Based Computing Laboratory

  3. Drivers of Modern HPC Cluster Architectures Accelerators / Coprocessors High Performance Interconnects - InfiniBand Multi-core Processors high compute density, high performance/watt <1us latency, >100 Gbps Bandwidth >1 Tflop/s DP on a chip • Multi-core processors are ubiquitous • InfiniBand is very popular in HPC clusters • Accelerators/Coprocessors becoming common in high-end systems • Pushing the envelope for Exascale computing Tianhe – 2 Stampede Titan Tianhe – 1A IPDPS 2016 3 Network Based Computing Laboratory

  4. Accelerators in HPC Systems • Growth of Accelerator-enabled clusters in the last 3 years – 22% of Top 50 clusters are boosted by NVIDIA GPUs in Nov’15 – From Top500 list (http://www.top500.org) 100 29 80 System Count 30 60 14 20 16 11 12 15 40 18 20 22 52 31 20 33 28 23 15 8 0 June-2013 Nov-2013 June-2014 Nov-2014 June-2015 Nov-2015 NVIDIA Kepler NVIDIA Fermi Intel Xeon Phi IPDPS 2016 4 Network Based Computing Laboratory

  5. Motivation • Parallel applications on GPU clusters – CUDA (Compute Unified Device Architecture): • Kernel computation on NVIDIA GPUs – CUDA-Aware MPI (Message Passing Interface): • Communications across processes/nodes • Non-blocking communication to overlap with CUDA kernels MPI_Isend(Buf1, ...,request1); MPI_Isend(Buf2, ...,request2); /* /* Independent Independent computations computations on on CPU/GPU CPU/GPU */ */ MPI_Wait (request1, status1); MPI_Wait (request2, status2); IPDPS 2016 5 Network Based Computing Laboratory

  6. Motivation • Use of non-contiguous data becoming common – Easy to represent complex data structure • MPI Datatypes – E.g., Fluid dynamic, image processing… • What if the data are on GPU memory? 1. Copy data to CPU to perform the packing/unpacking • Slower for large message • Data movements between GPU and CPU are expensive 2. Utilize GPU kernel to perform the packing/unpacking* • No explicit copies, faster for large message *R. Shi et al., “ HAND: A Hybrid Approach to Accelerate Non- contiguous Data Movement Using MPI Datatypes on GPU Clusters ,” in 43rd ICPP , Sept 2014, pp. 221–230. IPDPS 2016 6 Network Based Computing Laboratory

  7. Motivation – Non-Contiguous Data Movement in MPI Common Scenario Waste of computing resources on CPU and GPU MPI_Isend(Buf1, ...,req1); MPI_Isend(Buf2, ...,req2); Application work on the CPU/GPU MPI_Waitall(req, …) Timeline *Buf1, Buf2…contain non- contiguous MPI Datatype IPDPS 2016 7 Network Based Computing Laboratory

  8. Problem Statement • Low overlap between CPU and GPU for applications – Packing/Unpacking operations are serialized Proposed • CPU/GPU resources are not fully utilized User Naive User Advanced – GPU threads remain idle for most of the time Utilization Resource – Low utilization, low efficiency Can we have designs to Productivity Overlap leverage new GPU Performanc technology to address these issues? Farther from e the center is Better IPDPS 2016 8 Network Based Computing Laboratory

  9. Goals of this work • Proposes new designs leverage new NVIDIA GPU technologies Ø Hyper-Q technology (Multi-Streaming) Ø CUDA Event and Callback • Achieving Ø High performance and resource utilization for applications Ø High productivity for developers IPDPS 2016 9 Network Based Computing Laboratory

  10. Outline • Introduction • Proposed Designs – Event-based – Callback-based • Performance Evaluation • Conclusion IPDPS 2016 10 Network Based Computing Laboratory

  11. Overview Existing Design Wait Isend(1) Isend(1) Isend(1) CPU Initiate Initiate Initiate Wait For Wait For Wait For Kernel Kernel Kernel Start Start Send Send Send Start Progress Kernel Kernel Kernel (WFK) (WFK) (WFK) GPU Kernel on Stream Kernel on Stream Kernel on Stream Proposed Design CPU Wait Isend(1) Isend(2)Isend(3) Progress Initiate Initiate Initiate Kernel Kernel Kernel Start Start Start WFK Send WFK WFK Send Send GPU Kernel on Stream Expected Benefits Kernel on Stream Kernel on Stream Start Time Finish Proposed Finish Existing IPDPS 2016 11 Network Based Computing Laboratory

  12. Event-based Design • CUDA Event Management – Provides a mechanism to signal when tasks have occurred in a CUDA stream • Basic design idea 1. CPU launches a CUDA packing/unpacking kernel 2. CPU creates CUDA event and then returns immediately • GPU sets the status as ‘ completed’ when the kernel is completed 3. In MPI_Wait/MPI_Waitall : • CPU queries the events when the packed/unpacked data is required for communication IPDPS 2016 12 Network Based Computing Laboratory

  13. Event-based Design HCA CPU GPU pack_kernel1<<< >>> MPI_Isend() cudaEventRecord() pack_kernel2<<< >>> MPI_Isend() cudaEventRecord() pack_kernel3<<< >>> MPI_Isend() cudaEventRecord() MPI_Waitall() Query / Progress Send Completion Request Complete IPDPS 2016 13 Network Based Computing Laboratory

  14. Event-based Design • Major benefits – Overlap between CPU communication and GPU packing kernel – GPU resources are highly utilized • Limitation – CPU is required to keep checking the status of the event MPI_Isend(Buf1, ...,request1); • Lower CPU utilization MPI_Isend(Buf2, ...,request2); MPI_Wait (request1, status1); MPI_Wait (request2, status2); IPDPS 2016 14 Network Based Computing Laboratory

  15. Callback-based Design • CUDA Stream Callback – Launching work automatically on the CPU when something has completed on the CUDA stream – Restrictions: • Callbacks are processed by a driver thread, where no CUDA APIs can be called • Overhead when initializing callback function • Basic design idea 1. CPU launches a CUDA packing/unpacking kernel 2. CPU adds Callback function and then returns immediately 3. Callback function wakes up a helper thread to process the communication IPDPS 2016 15 Network Based Computing Laboratory

  16. Callback-based Design CPU HCA GPU main callback helper pack_kernel1<<< >>> MPI_Isend() addCallback() pack_kernel2<<< >>> MPI_Isend() addCallback() pack_kernel3<<< >>> MPI_Isend() addCallback() CPU Computations Callback Send Callback Callback Completion Request Complete MPI_Waitall() IPDPS 2016 16 Network Based Computing Laboratory

  17. Callback-based Design • Major benefits – Overlap between CPU communication and GPU packing kernel – Overlap between CPU communication and other computations – Higher CPU and GPU utilization MPI_Isend(Buf1, ...,&requests[0]); MPI_Isend(Buf2, ...,&requests[1]); MPI_Isend(Buf3, ...,&requests[2]); // // Application Application work work on on the the CPU CPU MPI_Waitall(requests, status); IPDPS 2016 17 Network Based Computing Laboratory

  18. Outline • Introduction • Proposed Designs • Performance Evaluation – Benchmark – HaloExchange-based Application Kernel • Conclusion IPDPS 2016 18 Network Based Computing Laboratory

  19. Overview of the MVAPICH2 Project • High Performance open-source MPI Library for InfiniBand, 10-40Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Used by more than 2,575 organizations in 80 countries – More than 376,000 (0.37 million) downloads from the OSU site directly – Empowering many TOP500 clusters (Nov ‘15 ranking) 10 th ranked 519,640-core cluster (Stampede) at TACC • 13 th ranked 185,344-core cluster (Pleiades) at NASA • 25 th ranked 76,032-core cluster (Tsubame 2.5) at Tokyo Institute of Technology and many others • – Available with software stacks of many vendors and Linux Distros (RedHat and SuSE) – http://mvapich.cse.ohio-state.edu • Empowering Top500 systems for over a decade System-X from Virginia Tech (3 rd in Nov 2003, 2,200 processors, 12.25 TFlops) -> – Stampede at TACC (10 th in Nov’15, 519,640 cores, 5.168 Plops) – IPDPS 2016 19 Network Based Computing Laboratory

  20. Experimental Environments 1. Wilkes cluster @ University of Cambridge – 2 NVIDIA K20c GPUs per node • Up to 32 GPU nodes 2. CSCS cluster @ Swiss National Supercomputing Centre – Cray CS-Storm system – 8 NVIDIA K80 GPUs per node • Up to 96 GPUs over 12 nodes IPDPS 2016 20 Network Based Computing Laboratory

Recommend


More recommend