GPUnet: networking abstractions for GPU programs Mark Silberstein Technion – Israel Institute of Technology Sangman Kim, Seonggu Huh, Xinya Zhang Amir Wated Yige Hu, Emmett Witchel Technion University of Texas at Austin Mark Silberstein - EE, Technion
What A socket API for programs running on GPU Why GPU-accelerated servers are hard to build Results GPU vs. CPU 50% throughput, 60% latency , ½ LOC Mark Silberstein - EE, Technion
Motivation: GPU-accelerated networking applications Data processing server Data processing server GPU GPU GPU MapReduce MapReduce GPU GPU GPU GPU Mark Silberstein - EE, Technion
Recent GPU-accelerated networking applications SSLShader (Jang 2011), GPU MapReduce (Stuart 2011), Deep Neural Networks (Coates 2013), Dandelion (Rossbach 2013), Rhythm (Agrawal 2014) ... Mark Silberstein - EE, Technion
Recent GPU-accelerated networking applications SSLShader (Jang 2011), GPU MapReduce (Stuart 2011), Deep Neural Networks (Coates 2013), Dandelion (Rossbach 2013), Rhythm (Agrawal 2014) ... required heroic efforts Mark Silberstein - EE, Technion
GPU-accelerated networking apps: Recurring themes NIC-GPU interaction Pipelining and buffer management Request batching Mark Silberstein - EE, Technion
GPU-accelerated networking apps: Recurring themes NIC-GPU interaction CPU-GPU-NIC Pipelining We will sidestep these problems Request batching Mark Silberstein - EE, Technion
The real problem: CPU is the only boss NIC Storage GPU CPU Mark Silberstein - EE, Technion
Example: CPU server CPU recv() compute() Memory NIC send() Mark Silberstein - EE, Technion
Inside a GPU-accelerated server CPU GPU Memory Memory NIC PCIe bus Theory Theory recv() recv() GPU_compute() GPU_compute() send() send() Mark Silberstein - EE, Technion
Inside a GPU-accelerated server recv(); CPU GPU Memory Memory NIC recv(); batch(); Theory recv() GPU_compute() send() Mark Silberstein - EE, Technion
Inside a GPU-accelerated server transfer(); CPU GPU Memory Memory NIC recv(); batch(); Theory optimize(); transfer(); recv() GPU_compute() send() Mark Silberstein - EE, Technion
Inside a GPU-accelerated server invoke(); CPU Memory Memory NIC recv(); batch(); Theory optimize(); transfer(); balance(); recv() GPU_compute(); GPU_compute() GPU_compute() send() Mark Silberstein - EE, Technion
Inside a GPU-accelerated server transfer(); CPU GPU Memory Memory NIC recv(); batch(); Theory optimize(); transfer(); balance(); recv() GPU_compute(); GPU_compute() GPU_compute() transfer(); cleanup(); send() Mark Silberstein - EE, Technion
Inside a GPU-accelerated server send(); CPU GPU Memory Memory NIC recv (); batch(); Theory optimize(); transfer(); recv() balance(); GPU_compute() GPU_compute() transfer(); cleanup(); send() dispatch(); send(); Mark Silberstein - EE, Technion
Aggressive pipelining Inside a GPU-accelerated server Double buffering, asynchrony, multithreading CPU Memory Memory NIC recv (); recv (); recv (); recv (); batch(); batch(); batch(); optimize(); batch(); optimize(); optimize(); transfer(); optimize(); transfer(); transfer(); balance(); transfer(); balance(); recv() balance(); GPU_compute(); balance(); GPU_compute(); GPU_compute(); transfer(); GPU_compute(); GPU_compute() GPU_compute() transfer(); transfer(); cleanup(); transfer(); cleanup(); cleanup(); dispatch(); send() cleanup(); dispatch(); dispatch(); send(); dispatch(); send(); send(); send(); Mark Silberstein - EE, Technion
This code is for a CPU to manage a GPU recv (); recv (); recv (); batch(); batch(); batch(); batch(); optimize(); optimize(); optimize(); optimize(); transfer(); transfer(); transfer(); transfer(); balance(); balance(); balance(); balance(); GPU_compute(); GPU_compute(); GPU_compute(); transfer(); GPU_compute() transfer(); transfer(); cleanup(); transfer(); cleanup(); cleanup(); dispatch(); dispatch(); cleanup(); dispatch(); send(); send(); send(); dispatch(); Mark Silberstein - EE, Technion
GPUs are not co- processors GPUs are peer- processors They need I/O abstractions File system I/O – [GPUfs ASPLOS13] Network I/O – this work Mark Silberstein - EE, Technion
GPUnet: socket API for GPUs Application view node0.technion.ac.il GPU native native server socket (AF_INET,SOCK_STREAM); listen (:2340) GPUnet Network GPU native native client CPU client socket (AF_INET,SOCK_STREAM); socket(AF_INET,SOCK_STREAM); connect (“node0:2340”); connect (“node0:2340”) GPUnet Mark Silberstein - EE, Technion
GPU-accelerated server with GPUnet CPU not involved CPU GPU Memory Memory NIC PCIe bus recv() GPU_compute() send() Mark Silberstein - EE, Technion
GPU-accelerated server with GPUnet GPU Memory NIC PCIe bus recv() GPU_compute() send() Mark Silberstein - EE, Technion
GPU-accelerated server with GPUnet No request batching send() recv() Memory NIC recv() recv() recv() GPU_compute() GPU_compute() GPU_compute() send() send() send() Mark Silberstein - EE, Technion
GPU-accelerated server with GPUnet Automatic send() request pipelining recv() Memory NIC Automatic buffer recv() recv() recv() management GPU_compute() GPU_compute() GPU_compute() send() send() send() Mark Silberstein - EE, Technion
Building a socket abstraction for GPUs Mark Silberstein - EE, Technion
Goals CPU GPU recv() Memory Memory NIC PCIe bus Simplicity Performance Reliable streaming NIC → GPU abstraction for GPUs data path optimizations Mark Silberstein - EE, Technion
Design option 1: Transport layer processing on CPU CPU GPU recv() Transport GPU controls processing the flow of data Network Memory buffers NIC Mark Silberstein - EE, Technion
Design option 1: Transport layer processing on CPU CPU GPU recv() Transport processing Network Memory buffers Extra CPU-GPU NIC memory transfers Mark Silberstein - EE, Technion
Design option 2: Transport layer processing on GPU CPU GPU recv() Transport processing Network Memory buffers P2P DMA P2P DMA NIC Mark Silberstein - EE, Technion
Design option 2: Transport layer processing on GPU CPU GPU recv() Transport processing TCP/IP on GPU? Network buffers CPU applications P2P DMA access network through GPU? NIC Mark Silberstein - EE, Technion
Not CPU, Not GPU We need help from NIC hardware Mark Silberstein - EE, Technion
RDMA: offloading transport layer processing to NIC CPU GPU Streaming Streaming Message Message buffers buffers Reliable RDMA NIC Mark Silberstein - EE, Technion
GPUnet layers GPU Socket API Reliable in-order streaming Reliable channel RDMA Transports Non-RDMA Transports Infiniband UNIX Domain Socket, TCP/IP Mark Silberstein - EE, Technion
GPUnet layers Simplicity GPU Socket API Reliable in-order streaming GPU Reliable channel RDMA Transports Non-RDMA Transports Infiniband UNIX Domain Socket, TCP/IP NIC CPU Performance Mark Silberstein - EE, Technion
See the paper for ● Coalesced API calls ● Latency-optimized GPU-CPU flow control ● Memory management ● Bounce buffers ● Non-RDMA support ● GPU performance optimizations Mark Silberstein - EE, Technion
Implementation ● Standard API calls, blocking/nonblocking ● libGPUnet.a : AF_INET, Streaming over Infiniband RDMA ● Fully compatible with CPU rsocket library ● libUNIXnet.a : AF_LOCAL: Unix Domain Sockets support for inter GPU/CPU-GPU Mark Silberstein - EE, Technion
Implementation GPU CPU GPU application GPUnet socket library Bounce GPUnet Flow Network buffers proxy control buffers CPU memory GPU memory NIC fallback Mark Silberstein - EE, Technion
Evaluation ● Analysis of GPU-native server design ● Matrix product server ● In-GPU-memory MapReduce ● Face verification server 2x6 Intel E5-2620, NVIDIA Tesla K20Xm GPU, Mellanox Connect-IB HCA, Switch-X bridge Mark Silberstein - EE, Technion
In-GPU-memory MapReduce GPUfs GPU GPU Map Map GPUnet Receiver Receiver Sort Sort Reduce Reduce Mark Silberstein - EE, Technion
In-GPU-memory MapReduce: Scalability 1 GPU 4 GPUs (no network) (GPUnet) K-means 5.6 sec 1.6 sec ( 3.5x ) Word-count 29.6 sec 10 sec ( 2.9x ) GPUnet enables scale-out for GPU – accelerated systems Mark Silberstein - EE, Technion
Face verification server memcached CPU client GPU server (unmodified) (unmodified) (GPUnet) via rsocket via rsocket Infiniband ? = recv() features() GPU_features() query_DB() GPU_compare() compare() send() Mark Silberstein - EE, Technion
Face verification: Different implementations 1 GPU 2500 Latency (μsec) (no GPUnet) CPU 2000 6 cores 99 th % 1500 25 th -75 th % 1 GPU 1000 Median GPUnet 500 23 34 54 Throughput (KReq/sec) Mark Silberstein - EE, Technion
Face verification: Different implementations 1.9x throughput 1 GPU 1/3x latency 2500 Latency (μsec) (no GPUnet) CPU ½ LOC 2000 6 cores 99 th % 1500 25 th -75 th % 1 GPU 1000 Median GPUnet 500 23 34 54 Throughput (KReq/sec) Mark Silberstein - EE, Technion
Recommend
More recommend