Using RDMA Efficiently for Key-Value Services Anuj Kalia (CMU) Michael Kaminsky (Intel Labs), David Andersen (CMU) 1
RDMA Remote Direct Memory Access: A network feature that allows direct access to the memory of a remote computer. 2
HERD 1. Improved understanding of RDMA through micro-benchmarking 2. High-performance key-value system: • Throughput: 26 Mops (2X higher than others) • Latency: 5 µs (2X lower than others) 3
RDMA intro Providers: Features: � • Ultra-low latency: 1 µs RTT InfiniBand, RoCE,… • Zero copy + CPU bypass User buffer DMA buffer NIC DMA buffer User buffer NIC A B 4
RDMA in the datacenter 48 port 10 GbE switches Switch RDMA Cost Mellanox SX1012 YES $5,900 Cisco 5548UP NO $8,180 Juniper EX5440 NO $7,480 5
In-memory KV stores Interface: GET, PUT memcached Webserver � memcached Requirements: Webserver • Low latency Webserver • High request rate Webserver Webserver Database 6
RDMA basics Verbs RDMA read: READ(local_buf, size, remote_addr) RDMA write: WRITE(local_buf, size, remote_addr) RNIC 7
Life of a WRITE Requester Responder CPU,RAM RNIC RNIC CPU,RAM 1 1: Request descriptor, PIO 2 2: Payload, DMA read 3 3: RDMA write request 4 4: Payload, DMA write 5 5: RDMA ACK 6 6: Completion, DMA write 8
Recent systems Pilaf [ATC 2013] FaRM-KV [NSDI 2014]: an example usage of FaRM � Approach: RDMA reads to access remote data structures Reason: the allure of CPU bypass 9
The price of CPU bypass Key-Value stores have an inherent level of indirection. An index maps a keys to address. Values are stored separately. At least 2 RDMA reads required: Server’s DRAM Index Values ≧ 1 to fetch address 1 to fetch value Not true if value is in index 10
The price of CPU bypass 11
The price of CPU bypass Server READ #1 (fetch pointer) Client 12
The price of CPU bypass Server READ #2 (fetch value) Client 13
Our approach Goal Main ideas Request-reply with server CPU involvement + #1: Use a single round trip WRITEs faster than READs #2. Increase throughput Low level verbs optimizations #3. Improve scalability Use datagram transport 14
#1: Use a single round trip DRAM accesses WRITE #1 (request) Server WRITE #2 (reply) Client 15
#1: Use a single round trip Operation Round Trips Operations at server’s RNIC 2+ READ-based GET 2+ RDMA reads 1 HERD GET 2 RDMA writes Lower latency High throughput
RDMA WRITEs faster than READs READ WRITE C 40 Throughput (Mops) C 30 C S C 20 C 10 Setup: Apt Cluster 0 � 4 32 64 92 128 160 192 224 256 192 nodes, 56 Gbps IB Payload size (Bytes) 17
RDMA WRITEs faster than READs Reason: PCIe writes faster than PCIe reads RDMA WRITE RDMA READ Server Server RDMA write request RDMA read request PCIe DMA write PCIe DMA read RDMA ACK RDMA read response RNIC CPU,RAM RNIC CPU,RAM 18
High-speed request-reply Request-reply throughput: 32 byte payloads 30 1 READ C1 Throughput (Mops) 1 20 2 WRITEs S 8 2 READs 10 C8 Setup: one-to-one client-server communication 0 Request-Reply READ 19
#2: Increase throughput Simple request-reply: Client Server WRITE #1: Request Processing WRITE #2: Response RNIC CPU,RAM CPU,RAM RNIC 20
Optimize WRITEs Requester Responder 1 +inlining: encapsulate payload in 2 request descriptor (2 → 1) 3 4 +unreliable: use unreliable transport (- 5) 5 +unsignaled: don’t ask for request completions (- 6) 6 CPU,RAM RNIC RNIC CPU,RAM 21
#2: Increase throughput Optimized request-reply: Client Server WRITE #1: Request Processing WRITE #2: Response RNIC CPU,RAM CPU,RAM RNIC 22
#2: Increase throughput basic +unreliable +unsignaled +inlined 30 25 C1 Throughput (Mops) 1 20 S 15 8 10 C8 5 Setup: one-to-one client-server communication 0 Request-Reply READ 23
#3: Improve scalability Request-Reply 30 Throughput (Mops) 25 C1 20 1 15 S 10 N CN 5 Setup 0 1 2 4 6 8 10 12 14 16 Number of client/server processes 24
#3: Improve scalability Clients SRAM State 1 C1 State 2 C2 State 3 C3 ||state|| > SRAM State N CN
#3: Improve scalability Inbound scalability ≫ outbound because inbound state ( ) outbound ( ) ≪ SRAM Use datagram for outbound replies C1 C2 C3 Datagram only supports SEND/RECV. SEND/RECV is slow. SEND/RECV is slow only at the receiver
Scalable request-reply Request-Reply (Naive) RDMA write, connected Request Reply (Hybrid) SEND, datagram 40 Throughput (Mops) 30 C1 1 20 S N 10 CN 0 Setup 1 2 4 6 8 10 12 14 16 Number of client/server processes 27
Evaluation HERD = Request-Reply + MICA [NSDI 2014] � Compare against emulated versions of Pilaf and FaRM-KV • No datastore • Focus on maximum performance achievable 28
Latency vs throughput 48 byte items, GET intensive workload HERD 12 95 th percentile Latency (microseconds) 8 5 th percentile 26 Mops, 5 µs 4 Low load, 3.4 µs 0 0 5 10 15 20 25 30 Throughput (Mops) 29
Latency vs throughput 48 byte items, GET intensive workload Emulated Pilaf Emulated FaRM-KV HERD 12 95 th percentile Latency (microseconds) 9 12 Mops, 8 µs 5 th percentile 6 26 Mops, 5 µs 3 Low load, 3.4 µs 0 0 5 10 15 20 25 30 Throughput (Mops) 30
Throughput comparison 16 byte keys, 95% GET workload Emulated Pilaf Emulated FaRM-KV HERD 30 Throughput (Mops) 20 2X higher 10 0 4 8 16 32 64 128 256 512 1024 Value size (Bytes) 31
HERD • Re-designing RDMA-based KV stores to use a single round trip • WRITEs outperform READs • Reduce PCIe and InfiniBand transactions • Embrace SEND/RECV • Code is online: https://github.com/efficient/HERD 32
Throughput comparison 16 byte keys, 95% GET workload Emulated Pilaf Emulated FaRM-KV HERD READ 30 Faster than RDMA reads Throughput (Mops) 20 10 0 4 8 16 32 64 128 256 512 1024 Value size 33
Throughput comparison 48 byte items 5% PUT 50% PUT 100% PUT 30 25 Throughput (Mops) 20 15 10 5 0 Emulated Pilaf Emulated FaRM-KV HERD 34
Recommend
More recommend