gpunet networking abstractions for gpu programs
play

GPUnet: networking abstractions for GPU programs Mark Silberstein - PowerPoint PPT Presentation

GPUnet: networking abstractions for GPU programs Mark Silberstein Technion Israel Institute of Technology Sangman Kim, Seonggu Huh, Xinya Zhang Amir Wated Yige Hu, Emmett Witchel Technion University of Texas at Austin Mark Silberstein -


  1. GPUnet: networking abstractions for GPU programs Mark Silberstein Technion – Israel Institute of Technology Sangman Kim, Seonggu Huh, Xinya Zhang Amir Wated Yige Hu, Emmett Witchel Technion University of Texas at Austin Mark Silberstein - EE, Technion

  2. What A socket API for programs running on GPU Why GPU-accelerated servers are hard to build Results GPU vs. CPU 50% throughput, 60% latency , ½ LOC Mark Silberstein - EE, Technion

  3. Motivation: GPU-accelerated networking applications Data processing server Data processing server GPU GPU GPU MapReduce MapReduce GPU GPU GPU GPU Mark Silberstein - EE, Technion

  4. Recent GPU-accelerated networking applications SSLShader (Jang 2011), GPU MapReduce (Stuart 2011), Deep Neural Networks (Coates 2013), Dandelion (Rossbach 2013), Rhythm (Agrawal 2014) ... Mark Silberstein - EE, Technion

  5. Recent GPU-accelerated networking applications SSLShader (Jang 2011), GPU MapReduce (Stuart 2011), Deep Neural Networks (Coates 2013), Dandelion (Rossbach 2013), Rhythm (Agrawal 2014) ... required heroic efforts Mark Silberstein - EE, Technion

  6. GPU-accelerated networking apps: Recurring themes NIC-GPU interaction Pipelining and buffer management Request batching Mark Silberstein - EE, Technion

  7. GPU-accelerated networking apps: Recurring themes NIC-GPU interaction CPU-GPU-NIC Pipelining We will sidestep these problems Request batching Mark Silberstein - EE, Technion

  8. The real problem: CPU is the only boss NIC Storage GPU CPU Mark Silberstein - EE, Technion

  9. Example: CPU server CPU recv() compute() Memory NIC send() Mark Silberstein - EE, Technion

  10. Inside a GPU-accelerated server CPU GPU Memory Memory NIC PCIe bus Theory Theory recv() recv() GPU_compute() GPU_compute() send() send() Mark Silberstein - EE, Technion

  11. Inside a GPU-accelerated server recv(); CPU GPU Memory Memory NIC recv(); batch(); Theory recv() GPU_compute() send() Mark Silberstein - EE, Technion

  12. Inside a GPU-accelerated server transfer(); CPU GPU Memory Memory NIC recv(); batch(); Theory optimize(); transfer(); recv() GPU_compute() send() Mark Silberstein - EE, Technion

  13. Inside a GPU-accelerated server invoke(); CPU Memory Memory NIC recv(); batch(); Theory optimize(); transfer(); balance(); recv() GPU_compute(); GPU_compute() GPU_compute() send() Mark Silberstein - EE, Technion

  14. Inside a GPU-accelerated server transfer(); CPU GPU Memory Memory NIC recv(); batch(); Theory optimize(); transfer(); balance(); recv() GPU_compute(); GPU_compute() GPU_compute() transfer(); cleanup(); send() Mark Silberstein - EE, Technion

  15. Inside a GPU-accelerated server send(); CPU GPU Memory Memory NIC recv (); batch(); Theory optimize(); transfer(); recv() balance(); GPU_compute() GPU_compute() transfer(); cleanup(); send() dispatch(); send(); Mark Silberstein - EE, Technion

  16. Aggressive pipelining Inside a GPU-accelerated server Double buffering, asynchrony, multithreading CPU Memory Memory NIC recv (); recv (); recv (); recv (); batch(); batch(); batch(); optimize(); batch(); optimize(); optimize(); transfer(); optimize(); transfer(); transfer(); balance(); transfer(); balance(); recv() balance(); GPU_compute(); balance(); GPU_compute(); GPU_compute(); transfer(); GPU_compute(); GPU_compute() GPU_compute() transfer(); transfer(); cleanup(); transfer(); cleanup(); cleanup(); dispatch(); send() cleanup(); dispatch(); dispatch(); send(); dispatch(); send(); send(); send(); Mark Silberstein - EE, Technion

  17. This code is for a CPU to manage a GPU recv (); recv (); recv (); batch(); batch(); batch(); batch(); optimize(); optimize(); optimize(); optimize(); transfer(); transfer(); transfer(); transfer(); balance(); balance(); balance(); balance(); GPU_compute(); GPU_compute(); GPU_compute(); transfer(); GPU_compute() transfer(); transfer(); cleanup(); transfer(); cleanup(); cleanup(); dispatch(); dispatch(); cleanup(); dispatch(); send(); send(); send(); dispatch(); Mark Silberstein - EE, Technion

  18. GPUs are not co- processors GPUs are peer- processors They need I/O abstractions File system I/O – [GPUfs ASPLOS13] Network I/O – this work Mark Silberstein - EE, Technion

  19. GPUnet: socket API for GPUs Application view node0.technion.ac.il GPU native native server socket (AF_INET,SOCK_STREAM); listen (:2340) GPUnet Network GPU native native client CPU client socket (AF_INET,SOCK_STREAM); socket(AF_INET,SOCK_STREAM); connect (“node0:2340”); connect (“node0:2340”) GPUnet Mark Silberstein - EE, Technion

  20. GPU-accelerated server with GPUnet CPU not involved CPU GPU Memory Memory NIC PCIe bus recv() GPU_compute() send() Mark Silberstein - EE, Technion

  21. GPU-accelerated server with GPUnet GPU Memory NIC PCIe bus recv() GPU_compute() send() Mark Silberstein - EE, Technion

  22. GPU-accelerated server with GPUnet No request batching send() recv() Memory NIC recv() recv() recv() GPU_compute() GPU_compute() GPU_compute() send() send() send() Mark Silberstein - EE, Technion

  23. GPU-accelerated server with GPUnet Automatic send() request pipelining recv() Memory NIC Automatic buffer recv() recv() recv() management GPU_compute() GPU_compute() GPU_compute() send() send() send() Mark Silberstein - EE, Technion

  24. Building a socket abstraction for GPUs Mark Silberstein - EE, Technion

  25. Goals CPU GPU recv() Memory Memory NIC PCIe bus Simplicity Performance Reliable streaming NIC → GPU abstraction for GPUs data path optimizations Mark Silberstein - EE, Technion

  26. Design option 1: Transport layer processing on CPU CPU GPU recv() Transport GPU controls processing the flow of data Network Memory buffers NIC Mark Silberstein - EE, Technion

  27. Design option 1: Transport layer processing on CPU CPU GPU recv() Transport processing Network Memory buffers Extra CPU-GPU NIC memory transfers Mark Silberstein - EE, Technion

  28. Design option 2: Transport layer processing on GPU CPU GPU recv() Transport processing Network Memory buffers P2P DMA P2P DMA NIC Mark Silberstein - EE, Technion

  29. Design option 2: Transport layer processing on GPU CPU GPU recv() Transport processing TCP/IP on GPU? Network buffers CPU applications P2P DMA access network through GPU? NIC Mark Silberstein - EE, Technion

  30. Not CPU, Not GPU We need help from NIC hardware Mark Silberstein - EE, Technion

  31. RDMA: offloading transport layer processing to NIC CPU GPU Streaming Streaming Message Message buffers buffers Reliable RDMA NIC Mark Silberstein - EE, Technion

  32. GPUnet layers GPU Socket API Reliable in-order streaming Reliable channel RDMA Transports Non-RDMA Transports Infiniband UNIX Domain Socket, TCP/IP Mark Silberstein - EE, Technion

  33. GPUnet layers Simplicity GPU Socket API Reliable in-order streaming GPU Reliable channel RDMA Transports Non-RDMA Transports Infiniband UNIX Domain Socket, TCP/IP NIC CPU Performance Mark Silberstein - EE, Technion

  34. See the paper for ● Coalesced API calls ● Latency-optimized GPU-CPU flow control ● Memory management ● Bounce buffers ● Non-RDMA support ● GPU performance optimizations Mark Silberstein - EE, Technion

  35. Implementation ● Standard API calls, blocking/nonblocking ● libGPUnet.a : AF_INET, Streaming over Infiniband RDMA ● Fully compatible with CPU rsocket library ● libUNIXnet.a : AF_LOCAL: Unix Domain Sockets support for inter GPU/CPU-GPU Mark Silberstein - EE, Technion

  36. Implementation GPU CPU GPU application GPUnet socket library Bounce GPUnet Flow Network buffers proxy control buffers CPU memory GPU memory NIC fallback Mark Silberstein - EE, Technion

  37. Evaluation ● Analysis of GPU-native server design ● Matrix product server ● In-GPU-memory MapReduce ● Face verification server 2x6 Intel E5-2620, NVIDIA Tesla K20Xm GPU, Mellanox Connect-IB HCA, Switch-X bridge Mark Silberstein - EE, Technion

  38. In-GPU-memory MapReduce GPUfs GPU GPU Map Map GPUnet Receiver Receiver Sort Sort Reduce Reduce Mark Silberstein - EE, Technion

  39. In-GPU-memory MapReduce: Scalability 1 GPU 4 GPUs (no network) (GPUnet) K-means 5.6 sec 1.6 sec ( 3.5x ) Word-count 29.6 sec 10 sec ( 2.9x ) GPUnet enables scale-out for GPU – accelerated systems Mark Silberstein - EE, Technion

  40. Face verification server memcached CPU client GPU server (unmodified) (unmodified) (GPUnet) via rsocket via rsocket Infiniband ? = recv() features() GPU_features() query_DB() GPU_compare() compare() send() Mark Silberstein - EE, Technion

  41. Face verification: Different implementations 1 GPU 2500 Latency (μsec) (no GPUnet) CPU 2000 6 cores 99 th % 1500 25 th -75 th % 1 GPU 1000 Median GPUnet 500 23 34 54 Throughput (KReq/sec) Mark Silberstein - EE, Technion

  42. Face verification: Different implementations 1.9x throughput 1 GPU 1/3x latency 2500 Latency (μsec) (no GPUnet) CPU ½ LOC 2000 6 cores 99 th % 1500 25 th -75 th % 1 GPU 1000 Median GPUnet 500 23 34 54 Throughput (KReq/sec) Mark Silberstein - EE, Technion

Recommend


More recommend