Fast QUIC sockets with vector packet processing Aloys Augustin, - PowerPoint PPT Presentation

Fast QUIC sockets with vector packet processing Aloys Augustin, Nathan Skrzypczak, Mathias Raoul, Dave Wallace 1

What is QUIC ? 2

The stack HTTP/2 HTTP/3 TLS QUIC TCP UDP IP 4

Nice properties ● Encryption by default ~ TLS 1.3 handshake ● No ossification ● Built-in multiplexing ○ Very common application requirement ○ Independent streams in each connection ○ Addresses head-of-line blocking ○ Stream prioritization support ● Supports mobility ○ 5-tuple may change without breaking the connection 5

Conns & streams Stream #1 Stream #2 Connection #1 Server Client Stream #1 Stream #2 Connection #2 6

Why QUIC - pros & cons Pros ● Runs on UDP, can be implemented out of the kernel ● Addresses head of line blocking ● 5-tuple mobility ● Encryption by default Cons ● Implementation complexity ● No standard northbound API (for now) ● Still evolving relatively fast, not an IETF standard yet 7

A quick dive in the code 8

Building blocks Socket API L4 / UDP QUIC implem Cient app Wire 9

Building blocks vpp client lib vpp quicly vectorization few assumptions fast L2-3-4 very modular pluggable sessions https://github.com/h2o/quicly 10

What is VPP? ● Fast userspace networking dataplane - https://fd.io/ ● Open-source: https://gerrit.fd.io/r/q/project:vpp ● Extensible through plugins ● Multi-architecture (x86, ARM, ...), runs in baremetal / VM / container ● Highly optimized for performance (vector instructions, cache efficiency, DPDK, native crypto, native drivers) ● Feature-rich L2-3-4 networking (switching, routing, IPsec, …) ● Includes a host stack with L4 (TCP, UDP) protocols → Great platform for a fast userspace QUIC stack 11

VPP Host stack (1/2) ● Generic session layer exposing L4 protocols ○ Socket-like APIs ● Fifos used to pass data between apps and protocols ● Internal API for VPP plugins ● Similar external API for independent processes available through a message queue ● Designed for high performance ○ Saturates 40G link with 1 TCP flow or 1 UDP flow ○ Performance scales linearly with number of threads 12

VPP Host stack (2/2) Message queue Control events Session Session rx fifo tx rx tx fifo int. application VCL TCP/UDP (vpp plugin) rx tx L2/L3 ext. application vpp 13

QUIC App Requirements Three types of objects: Stream #1 Stream #2 ● Listeners Connection #1 Listener #1 Server ● Connections Cient ● Streams Stream #1 Stream #2 Connection #2 14

Socket-like API for QUIC sockets listen(8000) Listener :8000 connect(server, 8000) accept(:8000) Connection a1 Connection c1 Server Client accept(a1) connect(c1) Stream a1-1 Stream c1-1 connect(a1) accept(c1) Stream a1-2 Stream c1-2 Three types of sockets for listeners, connections and streams Connection sockets are only used to connect and accept streams Connection sockets cannot send or receive data 15

Building a QUIC stack in VPP Message queue Control events Session Session rx tx VCL TCP / UDP rx tx L2/L3 application vpp 16

Building a QUIC stack in VPP Message queue Control events Session Session Session Session rx rx tx tx VCL UDP QUIC rx tx L2/L3 application vpp 17

Zooming in QUIC Northbound interface in VPP session layer Callbacks picotls quicly Callbacks Southbound interface to VPP session layer Allows quicly to use VPP UDP stack 18

QUIC Consumption models in VPP ● The VPP QUIC stack offers 3 consumption models: ○ External apps: independent, use the external (socket) API External interface mq ○ Internal apps: shared library loaded by VPP, use the internal API Northbound interface ○ Apps can use the quicly northbound API directly quicly → As long as they use the VPP threading model Southbound interface Northbound interface in the host stack is optional! 19

RX path Buffer Session Connection UDP copy event matching quicly initial quicly VPP UDP rx node session packet decoding packet decryption rx fifo Callback Decrypted payload copy QUIC stream session rx fifo Stream data available to quicly app Session MQ event event App Stream data Stream data notification App available to internal available to MQ app external app 20

Memory management and ACKs ● VPP fifos are fixed size. What if a sender sends more data than fifo size ? ○ Before a packet is decrypted, we have no way to know which stream(s) it contains data for → We cannot check the available space in the receiving fifo ○ Once a packet is decrypted, Quicly does not allow us to drop it otherwise it will never be retransmitted ○ Fortunately, QUIC has a connection parameter called max_stream_data , which limits the in-flight (un-acked) data per stream sent by peer. ○ Setting this parameter to the fifo size solves this problem, as long as we ACK data only when it is removed from the fifo ● QUIC has several other connection-level settings to control memory usage: ○ Max number of streams ○ Total un-acked data for the connection 21

TX path Session MQ Stream data Stream data App event event pushed by pushed by MQ internal app external app Payload copy Stream data pushed by QUIC stream session tx fifo quicly app Session event UDP Session payload UDP event quicly quicly copy session VPP session node packet generation packet encryption tx fifo 22

Backpressure ● UDP backpressure: we limit the amount of packets generated app by quicly so as not to overflow the UDP tx fifo app tx ● How does an app know it should wait before sending more data? QUIC ○ When Quicly cannot send data as fast as the app provides it, it stops reading from the QUIC streams tx fifos UDP tx ○ The app needs to check the amount of space available in the fifo before sending data ○ The app can subscribe to notifications when data is dequeued from its fifos UDP 23

Threading model ● VPP runs either with one thread, or one main thread + n worker threads ● UDP packets assignment to threads is dependent on RSS ○ The receiving thread is unknown when the first packet is sent ○ UDP connections start on one thread and migrate when the first reply is received ○ The VPP host stack sends notifications when this happens ● QUIC sessions are opened only when the handshake completes, and thus do not migrate (as long as there are no mobility events - not yet supported) ● All QUIC streams are placed in the thread where their connection exists 24

How quick is it ? 25

Performance: evaluation For now : no canonical QUIC perf assessment tool Custom iperf-like client/server benchmark tool ● Opens N connections ● Then opens n streams in each connection ● Client sends d bytes of data per stream ● Server closes the streams, then the connection Typical setup N=10 n=10 d=1GB 26

Performance: test setup quicly quicly test server test client XL710 XL710 40Gbps avf avf vpp vpp ● Core pinning, VPP and test apps on same NUMA node ● 1500 bytes MTU ● 2x Intel Xeon Gold 6146 3.2GHz CPUs 27

Performance: initial results 10x10 1 worker 3.5 Gbit/s 100x10 4 workers 13.7 Gbit/s Simultaneous connections ● Scales up to 100k streams / core ● Handshake rate ~1500 / s / core 28

Performance: optimisations ● Crypto ○ Quicly uses picotls by default for the TLS handshake and the packet encryption / decryption ○ Picotls has a pluggable encryption API, which uses openssl by default for encryption ○ Using the VPP native crypto API yielded better results ○ Further improvements were obtained by batching crypto operations, using the Quicly offload API: ■ N packets are received and decoded ■ These N packets are decrypted at the same time ■ The decrypted packets are passed to quicly for protocol processing ■ The same idea is applied in the TX path as well ● Congestion control ○ The default congestion control (Reno) of quicly doesn’t reach very high throughputs ○ Fortunately, it is pluggable as well :) 29

Performance: new results 10x10 pre-optimization 1 worker 3.5 Gbit/s 10x10 w/ batching & native crypto 1 worker 4.5 Gbit/s (+28%) For now, most of the CPU time is spent doing crypto Intel Ice Lake CPUs will accelerate AES and may move the bottleneck more towards the protocol processing 30

What’s next 31

Next steps ● Performance optimisation ● Mobility support ● Continuous benchmarking - soon on https://docs.fd.io/csit/master/trending/index.html If you want to get involved: https://gerrit.fd.io/r/q/project:vpp - code in src/plugins/quic/ If you want to try it, check out the example code in src/plugins/hs_apps/ (host stack apps) 32

Use cases ● Scalable HTTP/3 servers ● Scalable gRPC over QUIC servers ● QUIC VPN ○ Better than SSL VPN: mobility support, using one stream per flow allows to get rid of head of line blocking ○ As easy to deploy as an SSL VPN: only a certificate is needed on the server, with an authentication mechanism for clients ● QUIC VPN with transparent proxying ○ Transparently terminating the TCP connections at the VPN gateway and sending only the TCP payloads in QUIC streams allows to get rid of the nested congestion control issues 33

Fast QUIC sockets with vector packet processing Aloys Augustin, - PowerPoint PPT Presentation

Fast QUIC sockets with vector packet processing Aloys Augustin, Nathan Skrzypczak, Mathias Raoul, Dave Wallace 1 What is QUIC ? 2 3 The stack HTTP/2 HTTP/3 TLS QUIC TCP UDP IP 4 Nice properties Encryption by default ~ TLS 1.3

Raw Sockets and ICMP Raw Sockets and ICMP Code Examples Ping Traceroute Srinidhi

Unreliable Datagram Extension to QUIC draft-pauly-quic-datagram-00 Tommy Pauly , Eric Kinnear,

Multiplexing UDP-based protocols with QUIC January 2018, Melbourne Multiplexing QUIC and RFC

QUIC A New Internet Transport Presenter: Jana Iyengar QUIC and the IETF Nov 2013 Early design

Moving fast at scale Experience deploying IETF QUIC at Facebook Subodh Iyengar Luca Niccolini

Sockets: Overview Matt Ruffalo Adapted from content by Stuart Morgan, 2003 April 9, 2013

Worm Detection ICMP Packet Analysis Ankur Agiwal 1 2 Packet Content Matching Packet

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

UDP Sockets Java UDP Sockets Internet Application Development Internet Application Development T

Multipath QUIC: Design and Evaluation Quentin De Coninck , Olivier Bonaventure

QUIC CPU Pergormance Can HTTP/3 be as efficient as HTTP/2 and HTTP 1.1? SIGCOMM EPIQ 2020,

QUIC passive RTT measurement IETF 99 By Ian Swett Background Almost all of QUIC is encrypted,

Introduction to Packet Tracer What is Packet Tracer? Packet Tracer is a protocol simulator

Chapter 7 Packet-Switching Networks Routing in Packet Networks Shortest Path Routing Chapter 7

Day 3 Advanced Vector Architectures Session A: Vector Instruction Execution Pipelines Break

COSC 2P91 Sockets Week 10a Brock University Brock University (Week 10a) Sockets 1 / 14

Morphing and wavelet EnKF data assimilation Jan Mandel Based on joint work with J. D. Beezley, L.

Question: 1 2 3 4 5 Total Points: 15 10 9 9 9 52 Score: Name (First, Last): UCI ID

Hierarchical Phasers for Scalable Synchronization and Reductions in Dynamic Parallelism IPDPS

782 (93

A Survey on Multi-Formalism Performance Evaluation Tools Simonetta Balsamo Gian-Luca Dei Rossi

Small x Asymptotics of the Quark and Gluon Helicity Distributions Yuri Kovchegov The Ohio State

On the power on non-signalling and PPT-preserving codes Debbie Leung (IQC - University of

Large element orders and the characteristic of finite simple symplectic and orthogonal groups