design guidelines
play

Design Guidelines for High Performance RDMA Systems Anuj Kalia - PowerPoint PPT Presentation

Design Guidelines for High Performance RDMA Systems Anuj Kalia (CMU) Michael Kaminsky (Intel Labs) David Andersen (CMU) 1 RDMA is cheap (and fast!) Mellanox Connect-IB 2x 56 Gbps InfiniBand ~2 s RTT RDMA $1300 Problem


  1. Design Guidelines for High Performance RDMA Systems Anuj Kalia (CMU) Michael Kaminsky (Intel Labs) David Andersen (CMU) 1

  2. RDMA is cheap (and fast!) Mellanox Connect-IB • 2x 56 Gbps InfiniBand • ~2 µs RTT • RDMA • $1300 Problem Performance depends on complex low-level factors 2

  3. Background: RDMA read Core Core CPU L3 DMA read PCI Express RDMA read request NIC RDMA read response 3

  4. How to design a sequencer? Server 87 88 Client Client 4

  5. Which RDMA ops to use? Remote CPU bypass (one-sided) • Read • Write Perf? • Fetch-and-add 2.2 M/s • Compare-and-swap Remote CPU involved (messaging, two-sided) • Send • Recv 5

  6. How we sped up the sequencer by 50X 6

  7. Large RDMA design space Operations READ WRITE ATOMIC SEND, RECV Remote bypass (one-sided) Two-sided Transports Reliable Unreliable Connected Datagram Optimizations Inlined Unsignaled Doorbell batching WQE shrinking 0B-RECVs 7

  8. Guidelines NICs have multiple processing units (PUs) Avoid contention Exploit parallelism PCI Express messages are expensive Reduce CPU-to-NIC messages (MMIOs) Reduce NIC-to-CPU messages (DMAs) 8

  9. High contention w/ atomics Core Core CPU L3 Sequence counter A PCI Express DMA read DMA write Fetch&Add(A, 1) Latency ~500ns Throughput ~2 M/s PU PU 9

  10. Reduce contention: use CPU cores Core Core Core to L3: 20 ns L3 A DMA write PCI Express (500 ns) RDMA write (RPC req) NIC SEND (RPC resp) [HERD, SIGCOMM 14] 10

  11. Sequencer throughput 150 Throughput (M/s) 50x 120 90 60 30 7 2.2 0 Sequencer throughput Atomics RPC (1 core) 11

  12. Reduce MMIOs w/ Doorbell batching SEND SEND CPU Push NIC MMIOs ⇒ lots of CPU cycles SEND SEND CPU Pull NIC DMA 12

  13. RPCs w/ Doorbell batching Push Pull (Doorbell batching) CPU NIC CPU NIC Requests Requests Responses Responses 13

  14. Sequencer throughput 150 Throughput (M/s) 50x 120 90 60 30 16.6 7 2.2 0 Sequencer throughput Atomics RPC (1 C) +Dbell batching 14

  15. Exploit NIC parallelism w/ multiQ Core Core CPU L3 A PCI Express Idle Bottleneck SEND (RPC resp) 15

  16. Sequencer throughput 150 Throughput (M/s) 50x 120 90 60 27.4 30 16.6 7 2.2 0 Sequencer throughput Atomics RPC (1 C) +3 queues +Dbell batching 16

  17. Sequencer throughput 150 Throughput (M/s) 50x 120 97.2 90 60 27.4 30 16.6 7 2.2 0 Sequencer throughput Atomics RPC (1 C) +Batching +3 queues +6 cores Bottleneck = PCIe DMA bandwidth (paper) 17

  18. Reduce DMA size: Header-only 0 64 128 SEND CPU NIC 0 64 128 Header Imm Size Data Unused 64B 4B 8B 52B Move payload 0 64 Header Imm 18

  19. Sequencer throughput 150 Throughput (M/s) 122 50x 120 97.2 90 60 27.4 30 7 2.2 0 Sequencer throughput Atomics RPC (1 C) +4 Queues, +6 cores +Header-only Dbell batching 19

  20. Evaluation • Evaluation of optimizations on 3 RDMA generations • PCIe models, bottlenecks • More atomics experiments • Example: atomic operations on multiple addresses 20

  21. RPC-based key-value store 14 resps/doorbell Baseline +Doorbell Batching 100 9 resps/doorbell Throughput (M/s) 75 50 25 HERD [SIGCOMM 14] 16B keys, 32B values, 5% PUTs 0 0 2 4 6 8 10 12 14 Number of cores 21

  22. Conclusion NICs have multiple processing units (PUs) Avoid contention Exploit parallelism PCI Express messages are expensive Reduce CPU-to-NIC messages (MMIOs) Reduce NIC-to-CPU messages (DMAs) Code: https://github.com/anujkaliaiitd/rdma_bench 22

Recommend


More recommend