TEA: Enabling State-Intensive Network Functions on Programmable Switches Daehyeok Kim §‡ Zaoxing Liu § , Yibo Zhu ^ , Changhoon Kim † , Jeongkeun Lee † , Vyas Sekar § , Srinivasan Seshan § § Carnegie Mellon University ‡ Microsoft Research † Intel, Barefoot Switch Division ^ ByteDance Inc
Network functions in the network Network functions (NFs) are an essential component • E.g., Load balancer, Firewall, Network address translator (NAT), … NF performance and scalability are key challenges 2
Approaches to deploying network functions NAT NAT LB FW Virtualization layer NAT LB FW FW LB Programmable Commodity server switch ASIC Standalone hardware Server-based Software (NFV) Switch-based NF Fixed-function Programmable Programmable Performance: O (10 Gbps) Performance: O (1 Tbps) Performance: O (10 Gbps) Memory: O (10GB) DRAM Memory: O (10MB) SRAM Memory: O (10GB) DRAM Price: >$40K Price: $3K Price: $10K 3
Problem: serving demanding workloads None of the options can efficiently serve demanding workloads! • Millions of concurrent flows ( O (100MB) ) + high traffic rate ( > 1 Tbps ) NAT NAT LB FW Promising but Cost- and energy-inefficient L cannot maintain Virtualization layer NAT LB FW FW LB flow state L Programmable Commodity server switch ASIC Standalone hardware Switch-based NF Server-based Software (NFV) Fixed-function Programmable Programmable Performance: O (10 Gbps) Performance: O (1 Tbps) Performance: O (10 Gbps) Memory: O (10GB) DRAM Memory: O (10MB) SRAM Memory: O (10GB) DRAM Price: >$40K Price: $10K Price: $3K 4
Root cause: limited on-chip SRAM Limited on-chip SRAM space: O (10MB) Infeasible to maintain large flow state within on-chip SRAM • E.g., LB state for 10M flows requires ≈ 100MB Adding more SRAM would be too expensive L Can we leverage larger and cheaper DRAM near switch ASICs ? 5
DRAM available on a switch board Option #1: Option #2: DRAM on the switch control plane On-board, off-chip DRAM Switch board Switch board Switch control plane Switch control plane CPU DRAM Switch data plane (ASIC) Extension PCIe Switch data plane (ASIC) Pipeline stages DRAM On-chip SRAM Pipeline stages On-chip SRAM - Limited scalability in terms of size and access bandwidth - High cost 6
Opportunity: DRAM on commodity servers Switch board Switch control plane Switch data plane (ASIC) Server Pipeline stages DRAM On-chip SRAM + Scalable memory size and bandwidth + Low cost 7
Table Extension Architecture (TEA) Table lookups with low and predictable latency and scalable throughput Switch board APIs that allow easy Switch control plane integration with NFs Key Value Switch data plane (ASIC) Server Pipeline stages DRAM On-chip SRAM Virtual table abstraction for state-intensive NFs using external DRAM on servers 8
Outline Motivation TEA design Results 9
TEA design overview Developer Switch board Switch control plane + TEA NF Impl. P4 API Switch data plane (ASIC) Server Pipeline stages Binary DRAM P4 Compiler On-chip SRAM 10
Challenge 1: Enabling external DRAM access from switch ASIC Strawman: accessing external DRAM via the control plane Switch board Server Switch control plane Switch data plane (ASIC) Packet DRAM On-chip SRAM 11
Challenge 1: Enabling external DRAM access from switch ASIC Strawman: accessing external DRAM via the control plane High and unpredictable latency L Switch board Server Switch control plane Limited access bandwidth L Switch data plane (ASIC) Packet DRAM On-chip SRAM 11
Challenge 1: Enabling external DRAM access from switch ASIC Switch board Switch control plane Switch data plane (ASIC) Server Packet DRAM On-chip SRAM How to enable switch ASIC to directly access external DRAM without CPUs and hardware modifications? 11
Challenge 2: Enabling single round-trip table lookup Switch board Key Value Switch control plane 5-tuple Addr. Server Switch data plane (ASIC) DRAM Packet On-chip SRAM How to ensure a correct table entry can be retrieved in a single memory access? 12
Challenge 3: Deferred packet processing Switch board Key Value Switch control plane 5-tuple Addr. Server Switch data plane (ASIC) DRAM Packet On-chip SRAM How to defer packet processing during lookups without stalling the pipeline or buffering at the switch? 13
Challenge 4: Scaling TEA with multiple servers With a single server, scalability and availability can be limited Switch board A single server can fail Switch control plane Switch data plane (ASIC) Server DRAM Packet On-chip SRAM Access bandwidth is limited to a single link 14
Challenge 4: Scaling TEA with multiple servers Switch board How to deal with servers’ availability changes? Switch control plane Switch data plane (ASIC) Server DRAM Packet On-chip SRAM How to balance access loads across servers? 14
Challenge 1: How to access external DRAM in the data plane? Switch board Switch control plane Switch data plane (ASIC) Server Packet DRAM On-chip SRAM Switch ASICs do not have direct external DRAM access capability! Is it possible to enable ASICs to access external DRAM without hardware modifications and CPU involvement ? 15
Enabling RDMA in the switch data plane Key idea: Crafting RDMA packets using ASIC’s programmability 1. A packet comes into the pipeline Switch board Switch control plane Switch data plane (ASIC) Server Packet DRAM On-chip SRAM 16
Enabling RDMA in the switch data plane Key idea: Crafting RDMA packets using ASIC’s programmability 1. A packet comes into the pipeline Switch board Switch control plane 2. The ASIC adds RDMA headers to craft an RDMA request Switch data plane (ASIC) Server RDMA-H Packet DRAM On-chip SRAM 16
Enabling RDMA in the switch data plane Key idea: Crafting RDMA packets using ASIC’s programmability 1. A packet comes into the pipeline Switch board Switch control plane 2. The ASIC adds RDMA headers to craft an RDMA request Switch data plane (ASIC) Server 3. The server NIC replies as it RDMA-H Payload DRAM would for any standard RDMA On-chip SRAM request 16
Enabling RDMA in the switch data plane Key idea: Crafting RDMA packets using ASIC’s programmability 1. A packet comes into the pipeline Switch board Switch control plane 2. The ASIC adds RDMA headers to craft an RDMA request Switch data plane (ASIC) Server 3. The server NIC replies as it Payload DRAM would for any standard RDMA On-chip SRAM request 4. The ASIC parses the response 16
Enabling RDMA in the switch data plane Key idea: Crafting RDMA packets using ASIC’s programmability 1. A packet comes into the pipeline Switch board Switch control plane 2. The ASIC adds RDMA headers to craft an RDMA request Switch data plane (ASIC) Server 3. The server NIC replies as it Payload DRAM would for any standard RDMA On-chip SRAM request Simple switch-side flow control prevents buffer overflows at the NIC! à Simplified transport is enough! 4. The ASIC parses the response 16
Challenge 2: Single round-trip table lookups Switch board Switch control plane Switch data plane (ASIC) Server Packet DRAM On-chip SRAM RDMA read takes ~2 μ s: multiple accesses can result in high and unpredictable table lookup latency L Can we enable external table lookups in a single round trip? i.e., we need O(1) lookup mechanism 17
Cuckoo hashing as a potential approach Switch board Switch control plane 1 2 1 Switch data plane (ASIC) Server 2 Packet DRAM 3 On-chip SRAM 4 Cuckoo hash table in DRAM 18
Cuckoo hashing as a potential approach Switch board Switch control plane 1 2 1 hash 1 (x) Switch data plane (ASIC) Server 2 Packet DRAM 3 h a s h On-chip SRAM ( x ) 2 4 Cuckoo hash table in DRAM Can we enable table lookup with a single memory access? 18
Designed for improving TEA-table: lookup data structure cache hit rate in software switch [Zhuo’19*] Key idea: Repurposing bounded linear probing (BLP) Switch board 1 2 Switch control plane 1 Server 2 Switch data plane (ASIC) DRAM 3 Packet 4 On-chip SRAM TEA-table in DRAM *Dong Zhou. Data Structure Engineering for High Performance Software Packet Processing. Ph.D. Dissertation. Carnegie Mellon University, 2019. 19
TEA-table: lookup data structure Key idea: Repurposing bounded linear probing (BLP) Trading space efficiency for contiguous allocation Switch board 1 2 Switch control plane 1 Server 2 Switch data plane (ASIC) DRAM 3 Packet 4 On-chip SRAM TEA-table in DRAM 19
TEA-table: lookup data structure Key idea: Repurposing bounded linear probing (BLP) Can read two buckets in one RDMA read! Switch board 1 2 Switch control plane 1 Server hash (x) 2 Switch data plane (ASIC) DRAM 3 Packet 4 On-chip SRAM TEA-table in DRAM 19
Challenge 3: Deferred packet processing Switch board Switch control plane Switch data plane (ASIC) Server Packet DRAM On-chip SRAM Packet processing needs to be deferred until a lookup completes Can we defer only a select packet without stalling the pipeline? 20
Offloading packet store to TEA-table for asynchronous packet processing Idea #1: Employing scratchpad in TEA-table to buffer packets Switch board Switch control plane 1 2 Scratchpad 1 Switch data plane (ASIC) Server 2 Packet DRAM 3 On-chip SRAM 4 TEA-table in DRAM 21
Recommend
More recommend