Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory - PowerPoint PPT Presentation

Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory Philipp Fent, Alexander van Renen, Andreas Kipf, Technische Universität München, April 23, 2020 Viktor Leis ∗ , Thomas Neumann, Alfons Kemper Friedrich-Schiller-Universität Jena ∗

Communication Performance DBMS Data serialization format Network & transport protocol Physical interconnect Client Application Data serialization format Network & transport protocol Physical interconnect Philipp Fent et al. Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory 2 / 11

Communication Performance DBMS Data serialization format Network & transport protocol Physical interconnect Client Application Data serialization format Network & transport protocol Physical interconnect ODBC Philipp Fent et al. Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory 2 / 11

In-process Domain TCP Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory Philipp Fent et al. Figure: TPC-C throughput using Silo (b) 8 Threads 8.9 16 Sockets Linux 344 300 K 200 K 100 K 0 Communication Performance Transactions / second (a) 1 Thread 1.5 2.7 Sockets Linux 58 60 K 40 K 20 K 0 3 / 11 In-process Domain TCP

Communication Performance Transactions / second Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory Philipp Fent et al. Figure: TPC-C throughput using Silo (b) 8 Threads 8.9 16 Sockets Linux 344 300 K 200 K 100 K 0 3 / 11 (a) 1 Thread 1.5 2.7 Sockets Linux 58 60 K 40 K 20 K 0 In-process Domain TCP In-process Domain TCP

Understanding the Bottleneck Twofold actual bottleneck: TCP Through kernel communication Philipp Fent et al. Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory 4 / 11 • Misconception: Network is slow

Understanding the Bottleneck Through kernel communication Philipp Fent et al. Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory 4 / 11 • Misconception: Network is slow • Twofold actual bottleneck: • TCP

Understanding the Bottleneck read() Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory Philipp Fent et al. Figure: Kernel based communication syscall syscall Client Kernel write() DBMS 4 / 11 • Misconception: Network is slow • Twofold actual bottleneck: • TCP • Through kernel communication > 10 k cycles > 10 k cycles

Understanding the Bottleneck Kernel Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory Philipp Fent et al. Figure: Direct memory access memcpy() memcpy() Message bufger Client DBMS 4 / 11 • Misconception: Network is slow • Twofold actual bottleneck: • TCP • Through kernel communication 1 × mmap() 1 × mmap() ≈ 100 cycles ≈ 100 cycles

Low-Latency Communication Using Shared Memory Bootstrapped via Domain Sockets Pass message bufger via cmsg ancillary data Ringbufger with polling to transfer serialized data Philipp Fent et al. Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory 5 / 11 • Co-hosted on the same machine • Latency similar to embedded DBs, e.g. SQLite • Ideal interconnect for container / Docker environment

Low-Latency Communication Using Shared Memory Philipp Fent et al. Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory 5 / 11 • Co-hosted on the same machine • Latency similar to embedded DBs, e.g. SQLite • Ideal interconnect for container / Docker environment • Bootstrapped via Domain Sockets • Pass message bufger via cmsg ancillary data • Ringbufger with polling to transfer serialized data

Low-Latency Communication Using Shared Memory Available bandwidth depends on transmission parameters Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory Philipp Fent et al. 6 / 11 4 kB Size of transmission buffer [log] 2 . 1 5 GB/s 4 . 6 2 . 5 L2 cache L3 cache 4 . 9 4 . 8 2 . 8 32 kB 4 . 9 5 . 1 5 . 0 3 . 1 4 . 9 5 . 0 5 . 1 5 . 1 3 . 2 4 GB/s 4 . 9 5 . 1 5 . 2 5 . 1 5 . 1 3 . 2 256 kB 5 . 1 5 . 2 5 . 2 5 . 1 5 . 1 5 . 2 1 . 8 5 . 1 5 . 1 5 . 2 5 . 2 5 . 2 5 . 3 5 . 1 2 . 3 3 GB/s 2 5.3 5 5 . 0 5 . 2 5 . 3 5 . 2 5 . . 1 4 . 6 2 . 5 2 MB 5 . 0 5 . 2 5 . 2 5 . 2 5 . 2 5 . 3 5 . 1 5 . 0 5 . 0 2 . 8 5 . 0 5 . 1 5 . 2 5 . 1 5 . 0 5 . 1 5 . 3 5 . 0 5 . 0 5 . 0 2 . 8 4 . 9 5 . 0 5 . 1 5 . 1 5 . 1 5 . 2 5 . 2 5 . 0 4 . 9 4 . 9 5 . 0 2 . 9 2 GB/s 16 MB 4 . 4 4 . 7 4 . 8 4 . 9 4 . 9 4 . 9 4 . 9 4 . 7 4 . 6 4 . 6 4 . 6 4 . 7 2 . 4 4 . 4 4 . 7 4 . 8 4 . 8 4 . 8 4 . 9 4 . 9 4 . 6 4 . 5 4 . 6 4 . 6 4 . 5 4 . 4 1 . 6 4 . 4 4 . 6 4 . 8 4 . 8 4 . 8 4 . 8 4 . 9 4 . 6 4 . 6 4 . 5 4 . 5 4 . 6 4 . 5 2 . 9 1 . 5 128 MB 4 . 4 4 . 6 4 . 8 4 . 7 4 . 8 4 . 8 4 . 8 4 . 6 4 . 6 4 . 6 4 . 5 4 . 6 4 . 5 3 . 0 3 . 0 1 . 5 4 . 4 4 . 6 4 . 7 4 . 7 4 . 7 4 . 8 4 . 8 4 . 6 4 . 5 4 . 6 4 . 5 4 . 6 4 . 5 3 . 0 2 . 9 2 . 9 1 . 5 4 . 4 4 . 6 4 . 7 4 . 7 4 . 8 4 . 8 4 . 8 4 . 6 4 . 5 4 . 6 4 . 5 4 . 6 4 . 5 3 . 0 3 . 0 3 . 0 3 . 0 1 . 6 1 GB 4 . 3 4 . 6 4 . 7 4 . 7 4 . 8 4 . 8 4 . 8 4 . 6 4 . 5 4 . 6 4 . 5 4 . 6 4 . 5 3 . 0 3 . 0 3 . 0 3 . 0 2 . 9 1 . 6 4 kB 32 kB 256 kB 2 MB 16 MB 128 MB 1 GB Size of transmitted chunks [log]

Low-Latency Communication Using RDMA 400 K 1 4 16 64 256 0 200 K 600 K Write + Immediate Cache line Size of message [Byte] Two Writes Chained Writes Immediate Data Philipp Fent et al. Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory Send + Receive Two Writes 7 / 11 Write + Polling RDMA intricacies: Sync. Throughput [msgs s] 1 4 16 64 256 0 200 K 400 K 600 K Cache line 70 Size of message [Byte] • Co-located in the same datacenter • Bootstrapped via regular TCP/IP • Similar ringbufger as with Shared Memory

Low-Latency Communication Using RDMA 400 K 1 4 16 64 256 0 200 K 600 K Write + Immediate Cache line Size of message [Byte] Two Writes Chained Writes Immediate Data Philipp Fent et al. Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory Send + Receive Two Writes 7 / 11 0 1 4 16 64 Write + Polling 256 200 K 400 K 600 K Cache line Size of message [Byte] • Co-located in the same datacenter • Bootstrapped via regular TCP/IP • Similar ringbufger as with Shared Memory • RDMA intricacies: Sync. Throughput [msgs / s] + 70 %

Low-Latency Communication Using RDMA Message Bufger Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory Philipp Fent et al. written after message in-fmight [30] SELECT e FROM r WHERE x = 81 [28] SELECT a FROM r WHERE x = 28 X X effjcient polling Mailbox Scales up to the limit of RDMA’s reliable connections Two writes to separate memory regions Cache effjcient mailbox polling 8 / 11 • Asymmetric connections • Many message bufgers → random accesses for polling

Low-Latency Communication Using RDMA Message Bufger Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory Philipp Fent et al. written after message in-fmight [30] SELECT e FROM r WHERE x = 81 [28] SELECT a FROM r WHERE x = 28 X X effjcient polling Mailbox Scales up to the limit of RDMA’s reliable connections 8 / 11 • Asymmetric connections • Many message bufgers → random accesses for polling • Cache effjcient mailbox polling • Two writes to separate memory regions

Low-Latency Communication Using RDMA Message Bufger Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory Philipp Fent et al. written after message in-fmight [30] SELECT e FROM r WHERE x = 81 [28] SELECT a FROM r WHERE x = 28 X X effjcient polling Mailbox 8 / 11 • Asymmetric connections • Many message bufgers → random accesses for polling • Cache effjcient mailbox polling • Two writes to separate memory regions • Scales up to the limit of RDMA’s reliable connections

Results 3 1 K — Remote YCSB-C [sync. tx/s] 1 G Eth 56 G IB RDMA Silo + L5 15 K 27 K 302 K DBMS X 3 7 K — — MySQL 7 1 K 8 0 K — PostgreSQL 6 3 K 7 5 K — Philipp Fent et al. Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory — 378 K Local YCSB-C — [sync. tx/s] TCP SHM NP DS RDMA Silo + L5 685 K — 364 K DBMS X — — MySQL — — SQLite 9 / 11 50 . 5 K 72 . 1 K 7 . 56 K 11 . 5 K 11 . 5 K 10 . 0 K 45 . 9 K 27 . 6 K

Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory - PowerPoint PPT Presentation

Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory Philipp Fent, Alexander van Renen, Andreas Kipf, Technische Universitt Mnchen, April 23, 2020 Viktor Leis , Thomas Neumann, Alfons Kemper

FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs

Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs Jinkyu Jeong

Design of Flash- -Based DBMS: Based DBMS: Design of Flash Design of Flash-Based DBMS: An In-

Design Guidelines for High Performance RDMA Systems Anuj Kalia (CMU) Michael Kaminsky (Intel

the kernel bypass with RDMA! Using the RDMA infrastructure for performance while retaining kernel

Performance Isolation Anomalies in RDMA Yiwen Zhang with Juncheng Gu, Youngmoon Lee, Mosharaf

Shawn Hall Hybrid RDMA RDMA/SR mix for data, SR otherwise Client side events Completion of

RoGUE: RDMA over Generic Unconverged Ethernet Yanfang Le with Brent Stephens, Arjun Singhvi,

NFS over RDMA Brent Callaghan, Theresa Lingutla-Raj, Alex Chiu, Peter Staubach, Omer Asad Sun

Performance of RDMA-Capable Storage Performance of RDMA-Capable Storage Protocols on Wide-Area

Towards Low-Latency Byzantine Agreement Protocols Using RDMA DSN Workshop on Byzantine Consensus

CS743 - Principles of Database Management and Use Distribution, Replication, and CAP Ken Salem

DBMS + ML Julian Oks Josh Sennett Jan. 29, 2020 Context + Problem Statement Context: DBMS + ML

Alert: An Architecture for Transforming a Passive DBMS into an Active DBMS Ulf Schreier, Hamid

Low Latency Live Video Streaming over HTTP 2.0 Sheng Wei, Vishy Swaminathan | Adobe Research

STORM AND LOW-LATENCY PROCESSING www.inf.ed.ac.uk Low latency processing Similar to data

Transactional Memories: a theoretical introduction Selim Arsever & Pascal Perez Shared

Programming Shared-memory Platforms with Pthreads Xu Liu Derived from John Mellor-Crummeys

Shared Memory Parallelization of MTTKRP for Dense Tensors BLIS Retreat 2017, September 18 th Koby

Todays Topics - Distributed Shared Memory The Shared Memory Abstraction, why? Approaches

7 February 2019 Trust in water 1 Agenda Introduction 1200 to 1210 Alison, Ynon Base

Agricultural Applications of Agricultural Applications of Computer Science Computer Science CS

Introduction to Automated Task Planning Jonas Kvarnstrm Automated Planning and Diagnosis

Calculus 1120, Review for Prelim 1 Dan Barbasch September 25, 2012 Dan Barbasch Calculus 1120,

Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory - PowerPoint PPT Presentation

Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory Philipp Fent, Alexander van Renen, Andreas Kipf, Technische Universitt Mnchen, April 23, 2020 Viktor Leis , Thomas Neumann, Alfons Kemper

FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs

Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs Jinkyu Jeong

Design of Flash- -Based DBMS: Based DBMS: Design of Flash Design of Flash-Based DBMS: An In-

Design Guidelines for High Performance RDMA Systems Anuj Kalia (CMU) Michael Kaminsky (Intel

the kernel bypass with RDMA! Using the RDMA infrastructure for performance while retaining kernel

Performance Isolation Anomalies in RDMA Yiwen Zhang with Juncheng Gu, Youngmoon Lee, Mosharaf

Shawn Hall Hybrid RDMA RDMA/SR mix for data, SR otherwise Client side events Completion of

RoGUE: RDMA over Generic Unconverged Ethernet Yanfang Le with Brent Stephens, Arjun Singhvi,

NFS over RDMA Brent Callaghan, Theresa Lingutla-Raj, Alex Chiu, Peter Staubach, Omer Asad Sun

Performance of RDMA-Capable Storage Performance of RDMA-Capable Storage Protocols on Wide-Area

Towards Low-Latency Byzantine Agreement Protocols Using RDMA DSN Workshop on Byzantine Consensus

CS743 - Principles of Database Management and Use Distribution, Replication, and CAP Ken Salem

DBMS + ML Julian Oks Josh Sennett Jan. 29, 2020 Context + Problem Statement Context: DBMS + ML

Alert: An Architecture for Transforming a Passive DBMS into an Active DBMS Ulf Schreier, Hamid

Low Latency Live Video Streaming over HTTP 2.0 Sheng Wei, Vishy Swaminathan | Adobe Research

STORM AND LOW-LATENCY PROCESSING www.inf.ed.ac.uk Low latency processing Similar to data

Transactional Memories: a theoretical introduction Selim Arsever &amp; Pascal Perez Shared

Programming Shared-memory Platforms with Pthreads Xu Liu Derived from John Mellor-Crummeys

Shared Memory Parallelization of MTTKRP for Dense Tensors BLIS Retreat 2017, September 18 th Koby

Todays Topics - Distributed Shared Memory The Shared Memory Abstraction, why? Approaches

7 February 2019 Trust in water 1 Agenda Introduction 1200 to 1210 Alison, Ynon Base

Agricultural Applications of Agricultural Applications of Computer Science Computer Science CS

Introduction to Automated Task Planning Jonas Kvarnstrm Automated Planning and Diagnosis

Calculus 1120, Review for Prelim 1 Dan Barbasch September 25, 2012 Dan Barbasch Calculus 1120,

Transactional Memories: a theoretical introduction Selim Arsever & Pascal Perez Shared