Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory Philipp Fent, Alexander van Renen, Andreas Kipf, Technische Universität München, April 23, 2020 Viktor Leis ∗ , Thomas Neumann, Alfons Kemper Friedrich-Schiller-Universität Jena ∗
Communication Performance DBMS Data serialization format Network & transport protocol Physical interconnect Client Application Data serialization format Network & transport protocol Physical interconnect Philipp Fent et al. Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory 2 / 11
Communication Performance DBMS Data serialization format Network & transport protocol Physical interconnect Client Application Data serialization format Network & transport protocol Physical interconnect ODBC Philipp Fent et al. Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory 2 / 11
Communication Performance DBMS Data serialization format Network & transport protocol Physical interconnect Client Application Data serialization format Network & transport protocol Physical interconnect ODBC Philipp Fent et al. Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory 2 / 11
In-process Domain TCP Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory Philipp Fent et al. Figure: TPC-C throughput using Silo (b) 8 Threads 8.9 16 Sockets Linux 344 300 K 200 K 100 K 0 Communication Performance Transactions / second (a) 1 Thread 1.5 2.7 Sockets Linux 58 60 K 40 K 20 K 0 3 / 11 In-process Domain TCP
Communication Performance Transactions / second Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory Philipp Fent et al. Figure: TPC-C throughput using Silo (b) 8 Threads 8.9 16 Sockets Linux 344 300 K 200 K 100 K 0 3 / 11 (a) 1 Thread 1.5 2.7 Sockets Linux 58 60 K 40 K 20 K 0 In-process Domain TCP In-process Domain TCP
Understanding the Bottleneck Twofold actual bottleneck: TCP Through kernel communication Philipp Fent et al. Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory 4 / 11 • Misconception: Network is slow
Understanding the Bottleneck Through kernel communication Philipp Fent et al. Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory 4 / 11 • Misconception: Network is slow • Twofold actual bottleneck: • TCP
Understanding the Bottleneck read() Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory Philipp Fent et al. Figure: Kernel based communication syscall syscall Client Kernel write() DBMS 4 / 11 • Misconception: Network is slow • Twofold actual bottleneck: • TCP • Through kernel communication > 10 k cycles > 10 k cycles
Understanding the Bottleneck Kernel Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory Philipp Fent et al. Figure: Direct memory access memcpy() memcpy() Message bufger Client DBMS 4 / 11 • Misconception: Network is slow • Twofold actual bottleneck: • TCP • Through kernel communication 1 × mmap() 1 × mmap() ≈ 100 cycles ≈ 100 cycles
Low-Latency Communication Using Shared Memory Bootstrapped via Domain Sockets Pass message bufger via cmsg ancillary data Ringbufger with polling to transfer serialized data Philipp Fent et al. Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory 5 / 11 • Co-hosted on the same machine • Latency similar to embedded DBs, e.g. SQLite • Ideal interconnect for container / Docker environment
Low-Latency Communication Using Shared Memory Philipp Fent et al. Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory 5 / 11 • Co-hosted on the same machine • Latency similar to embedded DBs, e.g. SQLite • Ideal interconnect for container / Docker environment • Bootstrapped via Domain Sockets • Pass message bufger via cmsg ancillary data • Ringbufger with polling to transfer serialized data
Low-Latency Communication Using Shared Memory Available bandwidth depends on transmission parameters Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory Philipp Fent et al. 6 / 11 4 kB Size of transmission buffer [log] 2 . 1 5 GB/s 4 . 6 2 . 5 L2 cache L3 cache 4 . 9 4 . 8 2 . 8 32 kB 4 . 9 5 . 1 5 . 0 3 . 1 4 . 9 5 . 0 5 . 1 5 . 1 3 . 2 4 GB/s 4 . 9 5 . 1 5 . 2 5 . 1 5 . 1 3 . 2 256 kB 5 . 1 5 . 2 5 . 2 5 . 1 5 . 1 5 . 2 1 . 8 5 . 1 5 . 1 5 . 2 5 . 2 5 . 2 5 . 3 5 . 1 2 . 3 3 GB/s 2 5.3 5 5 . 0 5 . 2 5 . 3 5 . 2 5 . . 1 4 . 6 2 . 5 2 MB 5 . 0 5 . 2 5 . 2 5 . 2 5 . 2 5 . 3 5 . 1 5 . 0 5 . 0 2 . 8 5 . 0 5 . 1 5 . 2 5 . 1 5 . 0 5 . 1 5 . 3 5 . 0 5 . 0 5 . 0 2 . 8 4 . 9 5 . 0 5 . 1 5 . 1 5 . 1 5 . 2 5 . 2 5 . 0 4 . 9 4 . 9 5 . 0 2 . 9 2 GB/s 16 MB 4 . 4 4 . 7 4 . 8 4 . 9 4 . 9 4 . 9 4 . 9 4 . 7 4 . 6 4 . 6 4 . 6 4 . 7 2 . 4 4 . 4 4 . 7 4 . 8 4 . 8 4 . 8 4 . 9 4 . 9 4 . 6 4 . 5 4 . 6 4 . 6 4 . 5 4 . 4 1 . 6 4 . 4 4 . 6 4 . 8 4 . 8 4 . 8 4 . 8 4 . 9 4 . 6 4 . 6 4 . 5 4 . 5 4 . 6 4 . 5 2 . 9 1 . 5 128 MB 4 . 4 4 . 6 4 . 8 4 . 7 4 . 8 4 . 8 4 . 8 4 . 6 4 . 6 4 . 6 4 . 5 4 . 6 4 . 5 3 . 0 3 . 0 1 . 5 4 . 4 4 . 6 4 . 7 4 . 7 4 . 7 4 . 8 4 . 8 4 . 6 4 . 5 4 . 6 4 . 5 4 . 6 4 . 5 3 . 0 2 . 9 2 . 9 1 . 5 4 . 4 4 . 6 4 . 7 4 . 7 4 . 8 4 . 8 4 . 8 4 . 6 4 . 5 4 . 6 4 . 5 4 . 6 4 . 5 3 . 0 3 . 0 3 . 0 3 . 0 1 . 6 1 GB 4 . 3 4 . 6 4 . 7 4 . 7 4 . 8 4 . 8 4 . 8 4 . 6 4 . 5 4 . 6 4 . 5 4 . 6 4 . 5 3 . 0 3 . 0 3 . 0 3 . 0 2 . 9 1 . 6 4 kB 32 kB 256 kB 2 MB 16 MB 128 MB 1 GB Size of transmitted chunks [log]
Low-Latency Communication Using RDMA 400 K 1 4 16 64 256 0 200 K 600 K Write + Immediate Cache line Size of message [Byte] Two Writes Chained Writes Immediate Data Philipp Fent et al. Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory Send + Receive Two Writes 7 / 11 Write + Polling RDMA intricacies: Sync. Throughput [msgs s] 1 4 16 64 256 0 200 K 400 K 600 K Cache line 70 Size of message [Byte] • Co-located in the same datacenter • Bootstrapped via regular TCP/IP • Similar ringbufger as with Shared Memory
Low-Latency Communication Using RDMA 400 K 1 4 16 64 256 0 200 K 600 K Write + Immediate Cache line Size of message [Byte] Two Writes Chained Writes Immediate Data Philipp Fent et al. Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory Send + Receive Two Writes 7 / 11 0 1 4 16 64 Write + Polling 256 200 K 400 K 600 K Cache line Size of message [Byte] • Co-located in the same datacenter • Bootstrapped via regular TCP/IP • Similar ringbufger as with Shared Memory • RDMA intricacies: Sync. Throughput [msgs / s] + 70 %
Low-Latency Communication Using RDMA 400 K 1 4 16 64 256 0 200 K 600 K Write + Immediate Cache line Size of message [Byte] Two Writes Chained Writes Immediate Data Philipp Fent et al. Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory Send + Receive Two Writes 7 / 11 0 1 4 16 64 Write + Polling 256 200 K 400 K 600 K Cache line Size of message [Byte] • Co-located in the same datacenter • Bootstrapped via regular TCP/IP • Similar ringbufger as with Shared Memory • RDMA intricacies: Sync. Throughput [msgs / s] + 70 %
Low-Latency Communication Using RDMA Message Bufger Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory Philipp Fent et al. written after message in-fmight [30] SELECT e FROM r WHERE x = 81 [28] SELECT a FROM r WHERE x = 28 X X effjcient polling Mailbox Scales up to the limit of RDMA’s reliable connections Two writes to separate memory regions Cache effjcient mailbox polling 8 / 11 • Asymmetric connections • Many message bufgers → random accesses for polling
Low-Latency Communication Using RDMA Message Bufger Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory Philipp Fent et al. written after message in-fmight [30] SELECT e FROM r WHERE x = 81 [28] SELECT a FROM r WHERE x = 28 X X effjcient polling Mailbox Scales up to the limit of RDMA’s reliable connections 8 / 11 • Asymmetric connections • Many message bufgers → random accesses for polling • Cache effjcient mailbox polling • Two writes to separate memory regions
Low-Latency Communication Using RDMA Message Bufger Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory Philipp Fent et al. written after message in-fmight [30] SELECT e FROM r WHERE x = 81 [28] SELECT a FROM r WHERE x = 28 X X effjcient polling Mailbox 8 / 11 • Asymmetric connections • Many message bufgers → random accesses for polling • Cache effjcient mailbox polling • Two writes to separate memory regions • Scales up to the limit of RDMA’s reliable connections
Results 3 1 K — Remote YCSB-C [sync. tx/s] 1 G Eth 56 G IB RDMA Silo + L5 15 K 27 K 302 K DBMS X 3 7 K — — MySQL 7 1 K 8 0 K — PostgreSQL 6 3 K 7 5 K — Philipp Fent et al. Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory — 378 K Local YCSB-C — [sync. tx/s] TCP SHM NP DS RDMA Silo + L5 685 K — 364 K DBMS X — — MySQL — — SQLite 9 / 11 50 . 5 K 72 . 1 K 7 . 56 K 11 . 5 K 11 . 5 K 10 . 0 K 45 . 9 K 27 . 6 K
Recommend
More recommend