Design Guidelines for High Performance RDMA Systems Anuj Kalia (CMU) Michael Kaminsky (Intel Labs) David Andersen (CMU) 1
RDMA is cheap (and fast!) Mellanox Connect-IB • 2x 56 Gbps InfiniBand • ~2 µs RTT • RDMA • $1300 Problem Performance depends on complex low-level factors 2
Background: RDMA read Core Core CPU L3 DMA read PCI Express RDMA read request NIC RDMA read response 3
How to design a sequencer? Server 87 88 Client Client 4
Which RDMA ops to use? Remote CPU bypass (one-sided) • Read • Write Perf? • Fetch-and-add 2.2 M/s • Compare-and-swap Remote CPU involved (messaging, two-sided) • Send • Recv 5
How we sped up the sequencer by 50X 6
Large RDMA design space Operations READ WRITE ATOMIC SEND, RECV Remote bypass (one-sided) Two-sided Transports Reliable Unreliable Connected Datagram Optimizations Inlined Unsignaled Doorbell batching WQE shrinking 0B-RECVs 7
Guidelines NICs have multiple processing units (PUs) Avoid contention Exploit parallelism PCI Express messages are expensive Reduce CPU-to-NIC messages (MMIOs) Reduce NIC-to-CPU messages (DMAs) 8
High contention w/ atomics Core Core CPU L3 Sequence counter A PCI Express DMA read DMA write Fetch&Add(A, 1) Latency ~500ns Throughput ~2 M/s PU PU 9
Reduce contention: use CPU cores Core Core Core to L3: 20 ns L3 A DMA write PCI Express (500 ns) RDMA write (RPC req) NIC SEND (RPC resp) [HERD, SIGCOMM 14] 10
Sequencer throughput 150 Throughput (M/s) 50x 120 90 60 30 7 2.2 0 Sequencer throughput Atomics RPC (1 core) 11
Reduce MMIOs w/ Doorbell batching SEND SEND CPU Push NIC MMIOs ⇒ lots of CPU cycles SEND SEND CPU Pull NIC DMA 12
RPCs w/ Doorbell batching Push Pull (Doorbell batching) CPU NIC CPU NIC Requests Requests Responses Responses 13
Sequencer throughput 150 Throughput (M/s) 50x 120 90 60 30 16.6 7 2.2 0 Sequencer throughput Atomics RPC (1 C) +Dbell batching 14
Exploit NIC parallelism w/ multiQ Core Core CPU L3 A PCI Express Idle Bottleneck SEND (RPC resp) 15
Sequencer throughput 150 Throughput (M/s) 50x 120 90 60 27.4 30 16.6 7 2.2 0 Sequencer throughput Atomics RPC (1 C) +3 queues +Dbell batching 16
Sequencer throughput 150 Throughput (M/s) 50x 120 97.2 90 60 27.4 30 16.6 7 2.2 0 Sequencer throughput Atomics RPC (1 C) +Batching +3 queues +6 cores Bottleneck = PCIe DMA bandwidth (paper) 17
Reduce DMA size: Header-only 0 64 128 SEND CPU NIC 0 64 128 Header Imm Size Data Unused 64B 4B 8B 52B Move payload 0 64 Header Imm 18
Sequencer throughput 150 Throughput (M/s) 122 50x 120 97.2 90 60 27.4 30 7 2.2 0 Sequencer throughput Atomics RPC (1 C) +4 Queues, +6 cores +Header-only Dbell batching 19
Evaluation • Evaluation of optimizations on 3 RDMA generations • PCIe models, bottlenecks • More atomics experiments • Example: atomic operations on multiple addresses 20
RPC-based key-value store 14 resps/doorbell Baseline +Doorbell Batching 100 9 resps/doorbell Throughput (M/s) 75 50 25 HERD [SIGCOMM 14] 16B keys, 32B values, 5% PUTs 0 0 2 4 6 8 10 12 14 Number of cores 21
Conclusion NICs have multiple processing units (PUs) Avoid contention Exploit parallelism PCI Express messages are expensive Reduce CPU-to-NIC messages (MMIOs) Reduce NIC-to-CPU messages (DMAs) Code: https://github.com/anujkaliaiitd/rdma_bench 22
Recommend
More recommend