high performance
play

high performance networking from GPU kernels Feras Daoud Technion - PowerPoint PPT Presentation

GPUrdma: GPU-side library for high performance networking from GPU kernels Feras Daoud Technion Israel Institute of Technology Mark Silberstein Amir Watad Technion Technion 1 Agenda 1 Introduction 2 InfiniBand Background 3 GPUrdma


  1. GPUrdma: GPU-side library for high performance networking from GPU kernels Feras Daoud Technion – Israel Institute of Technology Mark Silberstein Amir Watad Technion Technion 1

  2. Agenda 1 Introduction 2 InfiniBand Background 3 GPUrdma 4 GPUrdma Evaluation 5 GPI2 2

  3. What • A GPU-side library for performing RDMA directly from GPU kernels Why • To improve communication performance between distributed GPUs Results • 5 µsec GPU-to-GPU communication latency and up to 50 Gbps transfer bandwidth 3

  4. Evolution of GPU-HCA interaction: Naive Version GPU CPU GPU CPU RAM RAM HCA Data Path Control Path 4

  5. Evolution of GPU-HCA interaction: GPUDirect RDMA Naive Version GPU CPU GPU CPU GPU CPU GPU CPU RAM RAM RAM RAM HCA HCA Direct Data Path Data Path Control Path 5

  6. Evolution of GPU-HCA interaction: GPUDirect RDMA Naive Version GPUrdma GPU CPU GPU CPU GPU CPU GPU CPU GPU CPU GPU CPU RAM RAM RAM RAM RAM RAM HCA HCA HCA Direct Data Path Direct Data Path Control Path Control Path 6

  7. Motivations GPUDirect RDMA Node CPU_rdma_read() GPU_kernel<<<>>> { GPU_Compute() } CPU_rdma_write() 7

  8. Motivations GPUDirect RDMA Node CPU_rdma_read() GPU_kernel<<<>>> { GPU_Compute() CPU Overhead } CPU_rdma_write() 8

  9. Motivations GPUDirect RDMA Node CPU_rdma_read() Communication Bulk-synchronous design GPU_kernel<<<>>> { and explicit pipelining GPU_Compute() Computation } CPU_rdma_write() Communication 9

  10. Motivations GPUDirect RDMA Node CPU_rdma_read() Multiple GPU kernel invocations CPU_rdma_read() CPU_rdma_read() CPU_rdma_read() GPU_kernel<<<>>> { GPU_kernel<<<>>> { GPU_kernel<<<>>> { GPU_kernel<<<>>> { 1. kernel calls overhead GPU_Compute() GPU_Compute() GPU_Compute() 2. Inefficient shared memory usage GPU_Compute() } } } } CPU_rdma_write() CPU_rdma_write() CPU_rdma_write() CPU_rdma_write() 10

  11. Motivations GPUDirect RDMA Node 0 3 5 2 5 1 8 CPU_rdma_read() GPU_kernel<<<>>> { 1 0 0 1 0 0 1 Find_Even_Num() } 0x0 CPU_rdma_write() Sparse data 0x3 Offsets 0x6 11

  12. GPUrdma library GPUrdma Node • No CPU intervention GPU_kernel<<<>>> { • Overlapping communication and computation GPU_rdma_read() • One kernel call GPU_Compute() • Efficient shared memory usage GPU_rdma_write() • Send spare data directly from the kernel } 12

  13. Agenda 1 Introduction 2 InfiniBand Background 3 GPUrdma 4 GPUrdma Evaluation 5 GPI2 13

  14. InfiniBand Background 1. Queue pair buffer (QP) QP Buffer • Send queue • Task WQE WQE Receive queue HCA Verbs 2. Work Queue Element • Contains communication instructions Completion CQE CQE 3. Completion queue buffer (CQ) • Contains completion elements CQ Buffer 4. Completion queue element • Contains information about completed jobs 14

  15. InfiniBand Background Ring the Door-Bell to execute jobs Control Path • MMIO address CPU • Informs the HCA about new jobs 1. Write work queue element to QP buffer CPU Memory 01100110 2. Ring the Door-Bell 01101001 10011001 QP CQ Data Data 3. Check completion queue element status Path 15

  16. Agenda 1 Introduction 2 InfiniBand Background 3 GPUrdma 4 GPUrdma Evaluation 5 GPI2 16

  17. GPUrdma Node • Control Path Direct path for data exchange GPU • Direct HCA control from GPU kernels • No CPU intervention GPU Memory 01100110 01101001 Native GPU Node 10011001 QP CQ Data Data Path 17

  18. GPUrdma Implementation CPU Memory 0110011001 CPU 1010011001 1001010101 QP CQ Data Data Path GPU Memory GPU 18

  19. GPUrdma Implementation CPU Memory Data Path - GPUDirect RDMA CPU QP CQ Data Path GPU Memory GPU 0110011001 1010011001 1001010101 Data 19

  20. GPUrdma Implementation CPU Memory 1. Move QP, CQ to GPU memory CPU QP CQ Data Path GPU Memory GPU 0110011001 1010011001 1001010101 Data 20

  21. GPUrdma Implementation CPU Memory 1. Move QP, CQ to GPU memory CPU Modify InfiniBand Verbs • ibv_create_qp() Data Path • ibv_create_cq() GPU Memory GPU 0110011001 1010011001 1001010101 QP CQ Data 21

  22. GPUrdma Implementation CPU Memory 2. Map the HCA doorbell address CPU into GPU address space Data Path GPU Memory GPU 0110011001 1010011001 1001010101 QP CQ Data 22

  23. GPUrdma Implementation CPU Memory 2. Map the HCA doorbell address CPU into GPU address space Modify NVIDIA driver Data Path GPU Memory GPU 0110011001 1010011001 1001010101 QP CQ Data 23

  24. Agenda 1 Introduction 2 InfiniBand Background 3 GPUrdma 4 GPUrdma Evaluation 5 GPI2 24

  25. GPUrdma Evaluation • Single QP • Multiple QP • Scalability - Optimal QP/CQ location NVIDIA Tesla K40c GPU Mellanox Connect-IB HCA 25

  26. GPUrdma – 1 thread , 1 QP • Best Performance CPU controller VS GPU controller CPU GPU 26

  27. GPUrdma – 1 thread , 1 QP • GPU controller – Optimize doorbell rings CPU Doorbell Optimized 3x GPU 27

  28. GPUrdma – 1 thread , 1 QP • GPU controller – Optimize CQ poll CPU Doorbell Optimized CQ Optimized 28

  29. GPUrdma – 32 threads , 1 QP • GPU controller – Write parallel jobs CPU Parallel writes CQ Optimized 29

  30. GPUDirect RDMA • CPU controller CPU 1 QP CPU 3 0 QPs 30

  31. GPUrdma – 30 QPs • 1 QP per Block vs 30 QPs per Block 50 Gbps 1 QP per Block 30 QPs per Block CPU 3 0 QPs 3x 31

  32. Scalability – Optimal QP/CQ location: CPU Memory 1. QP and CQ in GPU memory CPU 2. QP in GPU and CQ in system memory 3. CQ in GPU and QP in system memory Data Path 4. QP and CQ in GPU memory GPU Memory GPU 0110011001 1010011001 1001010101 QP CQ Data 32

  33. Scalability – Optimal QP/CQ location: CPU Memory 1. QP and CQ in GPU memory CPU 2. QP in GPU and CQ in system memory CQ 3. CQ in GPU and QP in system memory Data Path 4. QP and CQ in GPU memory GPU Memory GPU 0110011001 1010011001 1001010101 QP Data 33

  34. Scalability – Optimal QP/CQ location: CPU Memory 1. QP and CQ in GPU memory CPU 2. QP in GPU and CQ in system memory QP 3. CQ in GPU and QP in system memory Data Path 4. QP and CQ in GPU memory GPU Memory GPU 0110011001 1010011001 1001010101 CQ Data 34

  35. Scalability – Optimal QP/CQ location: CPU Memory 1. QP and CQ in GPU memory CPU 2. QP in GPU and CQ in system memory QP CQ 3. CQ in GPU and QP in system memory Data Path 4. QP and CQ in system memory GPU Memory GPU 0110011001 1010011001 1001010101 Data 35

  36. Optimal QP/CQ location: o Throughput: No difference o Latency: QP in CPU QP in GPU CQ in CPU 8.6 6.2 CQ in GPU 6.8 4.8 Transfer latency [µsec] 36

  37. Limitations GPUDirect RDMA - CUDA v7.5: Running kernel may observe STALE DATA or data that arrives OUT-OF-ORDER Scenario: Intensive RDMA writes to GPU memory Good news: NVIDIA announced a CUDA 8 feature that enables consistent update Suggested fix: CRC32 integrity check API for error detection 37

  38. Agenda 1 Introduction 2 InfiniBand Background 3 GPUrdma 4 GPUrdma Evaluation 5 GPI2 38

  39. GPI2 for GPUs: GPI - A framework to implement P artitioned G lobal A ddress S pace (PGAS) GPI2 - Extends this global address space to GPU memory Global Global Global Global memory Host memory Host memory GPU memory GPU Local memory Local memory Local memory Local memory GPU GPU Host Host Threads Threads Threads Threads 39

  40. GPI2 code example CPU Node GPU Node gaspi_segment_create (CPU_MEM) gaspi_segment_create (GPU_MEM) Initialize data gaspi_notify_waitsome gaspi_write_notify gaspi_notify_waitsome GPU_Compute _data<<<>>> gaspi_write_notify gaspi_proc_term gaspi_proc_term 40

  41. GPI2 using GPUrdma CPU Node GPU Node gaspi_segment_create (CPU_MEM) gaspi_segment_create (GPU_MEM) Initialize data GPU_start_kernel <<<>>> { gaspi_write_notify gpu_gaspi_notify_waitsome gaspi_notify_waitsome Compute_data() gpu_gaspi_write_notify } gaspi_proc_term gaspi_proc_term 41

  42. GPUrdma Multi-Matrix vector product Batch size GPI2 GPUrdma CPU Node GPU Node [Vectors] 480 2.6 11.7 Start timer 960 4.8 18.8 gaspi_write_notify gpu_notify_waitsome 1920 8.4 25.2 gaspi_notify_waitsome Matrix_compute() 3840 13.9 29.1 gpu_write_notify 7680 19.9 30.3 Stop timer 15360 24.3 31.5 • System throughput in millions of 32x1 vector multiplications per second as a function of the batch size 42

  43. Related works Lena Oden, Fraunhofer Institute for Industrial Mathematics: • Infiniband-Verbs on GPU: A case study of controlling an Infiniband network device from the GPU • Analyzing Put/Get APIs for Thread-collaborative Processors Mark Silberstein, Technion – Israel Institute of Technology: • GPUnet: networking abstractions for GPU programs • GPUfs: Integrating a file system with GPUs Thanks 43

Recommend


More recommend