s7300 managed communication for multi gpu systems
play

S7300: MANAGED COMMUNICATION FOR MULTI-GPU SYSTEMS Holger Frning 1 , - PowerPoint PPT Presentation

S7300: MANAGED COMMUNICATION FOR MULTI-GPU SYSTEMS Holger Frning 1 , Benjamin Klenk 1 , Hans Eberle 2 & Larry Dennison 2 1 Ruprecht-Karls University of Heidelberg, Germany 2 NVIDIA Research http://www.ziti.uni-heidelberg.de/compeng


  1. S7300: MANAGED COMMUNICATION FOR MULTI-GPU SYSTEMS Holger Fröning 1 , Benjamin Klenk 1 , Hans Eberle 2 & Larry Dennison 2 1 Ruprecht-Karls University of Heidelberg, Germany 2 NVIDIA Research http://www.ziti.uni-heidelberg.de/compeng holger.froening@ziti.uni-heidelberg.de GTC 2017, May 8, 2017

  2. ABOUT US & TODAY Performance and productivity for future and emerging technologies under hard power and energy constraints Rather unusual hardware engineers Sold on BSP styles of computing for data-intensive problems Strong computer engineering background, focus on low-level software layers High-performance analytics & high-performance computing Today’s talk An update on our work on GPU-centric communication 2

  3. GPU APPLICATIONS “Regular” algorithms: scientific/technical, HPC, machine learning Mostly dense matrix FFT , matrix-matrix multiplication, N-body, convolution, (deep) neural networks, finite-difference codes (PDE solvers) Excellent understanding in the community "Irregular" algorithms: most algorithms outside computational science Organized around pointer-based data structures Data mining, Bayesian inference, compilers, functional interpreters, Maxflow, n- Body methods (Barnes-Hut, fast multipole), mesh refinement, graphics (ray tracing), event-driven simulation, relational join (databases), ... Partly by Keshav Pingali et al., Amorphous Data-parallelism, technical report TR-09-05, U. Texas at Austin, 2009 3 David Kaeli, How Can GPUs Become First-Class Computing Devices?, William & Mary Computer Science Colloquium, October 26th 2016

  4. NOTE ON DEEP LEARNING training dataset shuffle data parallelism Training: 20 EFLOPs mini-batch @10TFLOP/s = 23 days forward prop forward prop model parallelism back prop back prop optimizer optimizer sequential dependence 4 Greg Diamos, HPC Opportunities in Deep Learning, Stanford Computer Systems Colloquium, October 5, 2016

  5. REMINDER: BULK-SYNCHRONOUS PARALLEL In 1990, Valiant already described GPU computing pretty well Superstep Compute, communicate, synchronize Parallel slackness: # of virtual processors v, physical processors p SM SM SM SM SM SM v = 1: not viable v = p: unpromising wrt optimality v >> p: scheduling and pipelining Address-sliced XBARs Extremely scalable A GPU is a (almost) perfect BSP processor L2 slice L2 slice 5 Leslie G. Valiant, A bridging model for parallel computation, Communications of the ACM, Volume 33 Issue 8, Aug. 1990

  6. TRANSITIONING TO MULTI-GPU IS FUNDAMENTAL SM SM SM SM SM SM Transition from SMP to NUMA Reasons: multi-GPU systems, multi-chip modules, heterogeneous memory, tiled layout Address-sliced XBARs Address-sliced XBARs Beauty of BSP is lost Kernel launch orchestration L2 slice L2 slice Data movement operations SM SM SM SM SM SM SM SM SM SM SM SM Naming a physical resource is disgusting Compute stack lacks NUMA support Programming models Address-sliced XBARs Address-sliced XBARs Address-sliced XBARs Address-sliced XBARs Abstractions Consistency model L2 slice L2 slice L2 slice L2 slice 6

  7. ADDRESSING NUMA Analyzing NUMA latency effects Pascal-class Read latency [usec] Observations on PCIe 2x GPU, PCIe unloaded loaded factor Huge penalty for local/remote local 0.250 0.461 1.8 peer 1.311 1.378 1.1 Unloaded/loaded penalty host 0.838 1.004 1.2 NVLINK changes the regime ~3.3- 5.2 ~2.2- 3.0 factor Strong and dynamic NUMA effects Pascal-class Bandwidth [GB/s] Publicization/privatization concept local 480 => Managed communication remote 16 Examples: MPI, TCP/IP , active messages, 30 factor various more … 7

  8. REST OF THIS TALK Background Understanding massively-parallel communication GPU-centric (but unmanaged) communication Introducing MANTARO Use cases for work execution 8

  9. BACKGROUND 9

  10. COMMUNICATION MODELS SHARED 
 T0 MEM Plain load/store (LD/ST) - de-facto standard in T1 store shared memory systems load Never designed for communication Can be fast for SMP , but often unknown costs for NUMA Assumption of perfectly timed load seeing a store resp. Message passing (MP) - de-facto standard in HPC Various p2p and collective functions LOCAL 
 LOCAL 
 Mainly send/recv semantics used - ease-of-use P1 MEM MEM P0 Overhead due to functionality & guarantees: copying, matching, progress, ordering send (X, 1, tag) Many more Match Active messages - latency tolerance becomes a programming/compiling concern recv (Y , 0, tag) One-sided communication (put/get) - never say receive 10

  11. GPU COMMUNICATION TODAY Standard: context switch to CPU PCIe PCIe Network PCIe PCIe GPU CPU NIC NIC CPU GPU Limited to coarse-grain communication finish finish 0 0 kernel kernel recv Kernel-completion boundaries copy data 1 Related work explores CPU helper x send network 2 threads packet 2 copy data copy data start kernel x completion completion #GPU entities >> #CPU entities copy data 0 Applicability depends on 1 communication pattern start kernel 0 [DGCN, dCUDA, ...] Computation CUDA stack MPI stack Possible overlap 0 1 2 x 11

  12. UPSHOT: CPU BYPASS HELPS GPU-to-GPU streaming Prototype system consisting of NVIDIA K20c, dual Intel Xeon E5, custom FPGA network 12

  13. UNDERSTANDING MASSIVELY-PARALLEL COMMUNICATION Do we need fine-grain privatization? 13

  14. APPROACH Application Pattern Ranks MOCFE (CESAR) Nearest Neighbor 64; 256; 1024 NEKBONE (CESAR) Nearest Neighbor 64; 256; 1024 Characteristics of massively CNS (EXACT) Nearest Neighbor 64; 256 parallel communication CNS Large (EXACT) Nearest Neighbor 64; 256; 1024 MultiGrid (EXACT) Nearest Neighbor 64; 256 Analyzing large-scale HPC MultiGrid Large (EXACT) Nearest Neighbor 64; 256; 1024 applications LULESH (EXMATEX) Nearest Neighbor 64; 512 DOE Exascale MPI proxy app traces CMC 2D (EXMATEX) Nearest Neighbor 64; 256; 1024 ~1/2 TB analyzed (25+TB available AMG (DF) Nearest Neighbor 216; 1728; 13824 online) AMG Boxlib (DF) Irregular 64; 1728 BIGFFT (DF) Many-to-many 100; 1024; 10000 BIGFFT Medium (DF) Many-to-many 100; 1024; 10000 Crystal Router (DF) Staged all-to-all 10; 100 14

  15. APPLICATION CHARACTERISTICS Observations Structured patterns Neighbor Many-to-many All-to-all Irregular 15

  16. APPLICATION CHARACTERISTICS Observations Structured patterns Collectives for synchronization, 
 point-to-point for communication 16

  17. APPLICATION CHARACTERISTICS Observations Structured patterns Collectives for synchronization, 
 point-to-point for communication Most messages are surprisingly small 17

  18. APPLICATION CHARACTERISTICS Observations Communication peers as percentage of all ranks Structured patterns Job size Min Median Max Collectives for synchronization, 
 (ranks) point-to-point for communication [0:63] 3.1 % 28.1 % 40.6 % Most messages are surprisingly small [64:127] 6.0 % 12.0 % 15.2 % Few communication peers [128:255] 0.6 % 7.8 % 26.4 % [256:511] 3.7 % 5.4 % 7.1 % [512:1023] 0.4 % 2.0 % 7.0 % [1024:2047] 1.3 % 2.0 % 4.6 % [8192:16383] 0.1 % 0.2 % 0.7 % 18

  19. APPLICATION CHARACTERISTICS Observations Application Pattern Ranks Structured patterns MOCFE (CESAR) Nearest Neighbor 64; 256; 1024 NEKBONE (CESAR) Nearest Neighbor 64; 256; 1024 Collectives for synchronization, 
 CNS (EXACT) Nearest Neighbor 64; 256 point-to-point for communication CNS Large (EXACT) Nearest Neighbor 64; 256; 1024 Most messages are surprisingly small MultiGrid (EXACT) Nearest Neighbor 64; 256 Few communication peers MultiGrid Large (EXACT) Nearest Neighbor 64; 256; 1024 Insights on communication LULESH (EXMATEX) Nearest Neighbor 64; 512 CMC 2D (EXMATEX) Nearest Neighbor 64; 256; 1024 Selective, structured and fine-grained AMG (DF) Nearest Neighbor 216; 1728; 13824 Little/no use of advanced MPI AMG Boxlib (DF) Irregular 64; 1728 features BIGFFT (DF) Many-to-many 100; 1024; 10000 Irregular applications will further BIGFFT Medium (DF) Many-to-many 100; 1024; 10000 push requirements Crystal Router (DF) Staged all-to-all 10; 100 Benjamin Klenk, Holger Fröning, An Overview of MPI Characteristics of Exascale Proxy Applications, International Supercomputer Conference 19 ISC 2017. (accepted for publication & best paper finalist)

  20. GPU-CENTRIC (BUT UNMANAGED) COMMUNICATION Addressing the need for privatization 20

  21. GPU-CENTRIC TRAFFIC SOURCING & SINKING GGAS : GPU-centric send/receive PCIe PCIe Network PCIe PCIe GPU CPU NIC NIC CPU GPU Thread-collective data movement Complete CPU bypass network 0 Cons store to GAS store to GPU memory packet Special hardware support required Reduced overlap 0 GRMA : GPU-centric put/get Key is simple descriptor format Cons 0 Special hardware support required Computation CUDA stack MPI stack Possible overlap 0 1 2 x Indirection to issue work Lena Oden and Holger Fröning, GGAS: Global GPU Address Spaces for Efficient Communication in Heterogeneous Clusters, IEEE 21 CLUSTER 2013

  22. MICRO-BENCHMARK PERFORMANCE GPU-to-GPU streaming Prototype system consisting of NVIDIA K20c, dual Intel Xeon E5, custom network MPI CPU-controlled: D2H, MPI send/recv, H2D Others GPU-controlled, bypassing CPU Results do not cover overheads regarding issue & completion 22

Recommend


More recommend