NetChain: Scale-Free Sub-RTT Coordination Xin Jin Xiaozhou Li, Haoyu Zhang, Robert Soulé, Jeongkeun Lee, Nate Foster, Changhoon Kim, Ion Stoica
Conventional wisdom: avoid coordination NetChain: lightning fast coordination enabled by programmable switches Open the door to rethink distributed systems design 1
Coordination services: fundamental building block of the cloud Applications Coordination Service Chubby 2
Provide critical coordination functionalities Applications Configuration Group Distributed Barrier Management Membership Locking Coordination Service 3
The core is a strongly-consistent, fault-tolerant key-value store Applications Configuration Group Distributed Barrier Management Membership Locking Coordination Service Strongly-Consistent, Fault-Tolerant Key-Value Store This Talk Servers 4
Workflow of coordination services request Can we do better? client coordination servers reply running a consensus protocol Ø Throughput: at most server NIC throughput Ø Latency: at least one RTT, typically a few RTTs 5
Opportunity: in-network coordination request Distributed coordination is communication-heavy, not computation-heavy. client coordination servers reply running a consensus protocol Server Switch Barefoot Tofino Example [NetBricks, OSDI’16] A few billion Packets per second 30 million 6.5 Tbps Bandwidth 10-100 Gbps < 1 us Processing delay 10-100 us 6
Opportunity: in-network coordination request client coordination switches reply running a consensus protocol Ø Throughput: switch throughput Ø Latency: half of an RTT 7
Design goals for coordination services Ø High throughput Directly from high-performance switches Ø Low latency Ø Strong consistency How? Ø Fault tolerance 8
Design goals for coordination services Ø High throughput Directly from high-performance switches Ø Low latency Ø Strong consistency Chain replication in the network Ø Fault tolerance 9
What is chain replication Read Read Request Reply S 0 S 1 S 2 Head Replica Tail Ø Storage nodes are organized in a chain structure Ø Handle operations Ø Read from the tail 10
What is chain replication Write Read Read/Write Request Request Reply S 0 S 1 S 2 Head Replica Tail Ø Storage nodes are organized in a chain structure Ø Handle operations Ø Read from the tail Ø Write from head to tail Ø Provide strong consistency and fault tolerance Ø Tolerate f failures with f+1 nodes 11
Division of labor in chain replication: a perfect match to network architecture Storage Nodes Auxiliary Master Chain Optimize for high-performance to • Handle less frequent reconfiguration • Replication handle read & write requests Provide fault tolerance • Provide strong consistency • Network Data Plane Network Control Plane Network Handle packets at line rate • • Handle network reconfiguration Architecture 12
NetChain overview NetChain Handle read & write requests Handle reconfigurations at line rate (e.g., switch failures) S 0 S 1 Network Controller S 2 S 3 S 4 S 5 Host Racks 13
How to build a strongly-consistent, fault-tolerant, in-network key-value store Ø How to store and serve key-value items? Data Ø How to route queries according to chain structure? Plane Ø How to handle out-of-order delivery in network? Control Ø How to handle switch failures? Plane 14
PISA: Protocol Independent Switch Architecture Ø Programmable Parser Ø Convert packet data into metadata Ø Programmable Mach-Action Pipeline Ø Operate on metadata and update memory state Match + Action ALU Memory … … … … Programmable Parser Programmable Match-Action Pipeline 15
PISA: Protocol Independent Switch Architecture Ø Programmable Parser Ø Parse custom key-value fields in the packet Ø Programmable Mach-Action Pipeline Ø Read and update key-value data at line rate Match + Action ALU Memory … … … … Programmable Parser Programmable Match-Action Pipeline 16
Network NetChain NetChain Control plane (CPU) Controller Management Switch Agent PCIe Run-time API Network Key-Value Data plane (ASIC) Functions Store Match + Action ALU Memory … … … … Programmable Parser Programmable Match-Action Pipeline 17
How to build a strongly-consistent, fault-tolerant, in-network key-value store Ø How to store and serve key-value items? Data Ø How to route queries according to chain structure? Plane Ø How to handle out-of-order delivery in network? Control Ø How to handle switch failures? Plane 18
NetChain packet format Existing Protocols NetChain Protocol ETH IP UDP SC S 0 S 1 … S k OP SEQ KEY VALUE L2/L3 routing NetChain routing read, write, delete, etc. inserted by head switch reserved port # Ø Application-layer protocol: compatible with existing L2-L4 layers Ø Invoke NetChain with a reserved UDP port 19
In-network key-value storage Match-Action Table Register Array (RA) Match Action 0 1 Key = X Read/Write RA[0] 2 Key = Y Read/Write RA[5] 3 Key = Z Read/Write RA[2] 4 5 Default Drop() Ø Key-value store in a single switch Ø Store and serve key-value items using register arrays [SOSP’17, NetCache] Ø Key-value store in the network Ø Data partitioning with consistent hashing and virtual nodes 20
How to build a strongly-consistent, fault-tolerant, in-network key-value store Ø How to store and serve key-value items? Data Ø How to route queries according to chain structure? Plane Ø How to handle out-of-order delivery in network? Control Ø How to handle switch failures? Plane 21
NetChain routing: segment routing according to chain structure H 0 Write Request Write Reply Client dstIP SC dstIP SC … … S 1 S 2 … … … … = S 0 = 2 = H 0 = 0 S 0 S 1 S 2 Head Replica Tail dstIP SC dstIP SC … … S 2 … … … … = S 1 = 1 = S 2 = 0 22
NetChain routing: segment routing according to chain structure H 0 Client Read Reply Read Request dstIP SC … … S 1 S 0 … = H 0 = 2 dstIP SC … … S 1 S 0 … = S 2 = 2 S 0 S 1 S 2 Head Replica Tail 23
How to build a strongly-consistent, fault-tolerant, in-network key-value store Ø How to store and serve key-value items? Data Ø How to route queries according to chain structure? Plane Ø How to handle out-of-order delivery in network? Control Ø How to handle switch failures? Plane 24
Problem of out-of-order delivery Concurrent Writes Head Replica Tail W 1 : foo=B S 0 S 1 S 2 W 2 : foo=C foo=A foo=A foo=A foo=B foo=C foo=C foo=B foo=B foo=C time Inconsistent values between three replicas Serialization with sequence number 25
How to build a strongly-consistent, fault-tolerant, in-network key-value store Ø How to store and serve key-value items? Data Ø How to route queries according to chain structure? Plane Ø How to handle out-of-order delivery in network? Control Ø How to handle switch failures? Plane 26
Before failure: tolerate f failures with f+1 nodes Handle a switch failure S 0 S 1 S 2 Fast Failover Failure Recovery S 0 S 2 S 0 S 3 S 2 Ø Failover to remaining f nodes Ø Add another switch Ø Tolerate f-1 failures Ø Tolerate f+1 failures again Ø Efficiency: only need to update Ø Consistency: two-phase atomic neighbor switches of failed switch switching Ø Minimize disruption: virtual groups 27
Protocol correctness Invariant. For any key k that is assigned to a chain of nodes [S 1 , S 2 , …, S n ], if 1 ≤ 𝑗 < 𝑘 ≤ 𝑜 (i.e., S i is a predecessor of S j ), then 𝑇𝑢𝑏𝑢𝑓 + , 𝑙 . 𝑡𝑓𝑟 ≥ 𝑇𝑢𝑏𝑢𝑓 + 2 𝑙 . 𝑡𝑓𝑟 . Ø Guarantee strong consistency under packet loss, packet reordering, and switch failures Ø See paper for TLA+ specification 28
Implementation Ø Testbed Ø 4 Barefoot Tofino switches and 4 commodity servers Ø Switch Ø P4 program on 6.5 Tbps Barefoot Tofino Ø Routing: basic L2/L3 routing Ø Key-value store: up to 100K items, up to 128-byte values Ø Server Ø 16-core Intel Xeon E5-2630, 128 GB memory, 25/40 Gbps Intel NICs Ø Intel DPDK to generate query traffic: up to 20.5 MQPS per server 29
Evaluation Ø Can NetChain provide significant performance improvements? Ø Can NetChain scale out to a large number of switches? Ø Can NetChain efficiently handle failures? Ø Can NetChain benefit applications? 30
Evaluation Ø Can NetChain provide significant performance improvements? Ø Can NetChain scale out to a large number of switches? Ø Can NetChain efficiently handle failures? Ø Can NetChain benefit applications? 31
Orders of magnitude higher throughput 1etCKaiQ(Pax) 1etCKaiQ(4) ZooKeeSer 1etCKaiQ(Pax) 1etCKaiQ(4) ZooKeeSer 10 4 10 4 2000 MQPS 2000 MQPS TKrougKSut (0436) TKrougKSut (0436) 10 3 10 3 82 MQPS 82 MQPS 10 2 10 2 10 1 10 1 10 0 10 0 0.15 MQPS 0.15 MQPS 10 -1 10 -1 10 -2 10 -2 0 32 64 96 128 0 20K 40K 60K 80K 100K 9alue 6ize (Byte) 6tore 6ize 32
Orders of magnitude lower latency 2350 us 10 4 ZooKeeSer (ZrLte) ZooKeeSer (read) 10 3 /ateQcy ( µs ) 1etCKaLQ (read/ZrLte) 170 us 10 2 9.7 us 10 1 10 0 10 -3 10 -2 10 -1 10 0 10 1 10 2 10 3 10 4 TKrougKSut (043S) 33
Recommend
More recommend