NetChain: Scale-Free Sub-RTT Coordination Xin Jin Xiaozhou Li, - PowerPoint PPT Presentation

NetChain: Scale-Free Sub-RTT Coordination Xin Jin Xiaozhou Li, Haoyu Zhang, Robert Soulé, Jeongkeun Lee, Nate Foster, Changhoon Kim, Ion Stoica

Conventional wisdom: avoid coordination NetChain: lightning fast coordination enabled by programmable switches Open the door to rethink distributed systems design 1

Coordination services: fundamental building block of the cloud Applications Coordination Service Chubby 2

Provide critical coordination functionalities Applications Configuration Group Distributed Barrier Management Membership Locking Coordination Service 3

The core is a strongly-consistent, fault-tolerant key-value store Applications Configuration Group Distributed Barrier Management Membership Locking Coordination Service Strongly-Consistent, Fault-Tolerant Key-Value Store This Talk Servers 4

Workflow of coordination services request Can we do better? client coordination servers reply running a consensus protocol Ø Throughput: at most server NIC throughput Ø Latency: at least one RTT, typically a few RTTs 5

Opportunity: in-network coordination request Distributed coordination is communication-heavy, not computation-heavy. client coordination servers reply running a consensus protocol Server Switch Barefoot Tofino Example [NetBricks, OSDI’16] A few billion Packets per second 30 million 6.5 Tbps Bandwidth 10-100 Gbps < 1 us Processing delay 10-100 us 6

Opportunity: in-network coordination request client coordination switches reply running a consensus protocol Ø Throughput: switch throughput Ø Latency: half of an RTT 7

Design goals for coordination services Ø High throughput Directly from high-performance switches Ø Low latency Ø Strong consistency How? Ø Fault tolerance 8

Design goals for coordination services Ø High throughput Directly from high-performance switches Ø Low latency Ø Strong consistency Chain replication in the network Ø Fault tolerance 9

What is chain replication Read Read Request Reply S 0 S 1 S 2 Head Replica Tail Ø Storage nodes are organized in a chain structure Ø Handle operations Ø Read from the tail 10

What is chain replication Write Read Read/Write Request Request Reply S 0 S 1 S 2 Head Replica Tail Ø Storage nodes are organized in a chain structure Ø Handle operations Ø Read from the tail Ø Write from head to tail Ø Provide strong consistency and fault tolerance Ø Tolerate f failures with f+1 nodes 11

Division of labor in chain replication: a perfect match to network architecture Storage Nodes Auxiliary Master Chain Optimize for high-performance to • Handle less frequent reconfiguration • Replication handle read & write requests Provide fault tolerance • Provide strong consistency • Network Data Plane Network Control Plane Network Handle packets at line rate • • Handle network reconfiguration Architecture 12

NetChain overview NetChain Handle read & write requests Handle reconfigurations at line rate (e.g., switch failures) S 0 S 1 Network Controller S 2 S 3 S 4 S 5 Host Racks 13

How to build a strongly-consistent, fault-tolerant, in-network key-value store Ø How to store and serve key-value items? Data Ø How to route queries according to chain structure? Plane Ø How to handle out-of-order delivery in network? Control Ø How to handle switch failures? Plane 14

PISA: Protocol Independent Switch Architecture Ø Programmable Parser Ø Convert packet data into metadata Ø Programmable Mach-Action Pipeline Ø Operate on metadata and update memory state Match + Action ALU Memory … … … … Programmable Parser Programmable Match-Action Pipeline 15

PISA: Protocol Independent Switch Architecture Ø Programmable Parser Ø Parse custom key-value fields in the packet Ø Programmable Mach-Action Pipeline Ø Read and update key-value data at line rate Match + Action ALU Memory … … … … Programmable Parser Programmable Match-Action Pipeline 16

Network NetChain NetChain Control plane (CPU) Controller Management Switch Agent PCIe Run-time API Network Key-Value Data plane (ASIC) Functions Store Match + Action ALU Memory … … … … Programmable Parser Programmable Match-Action Pipeline 17

NetChain packet format Existing Protocols NetChain Protocol ETH IP UDP SC S 0 S 1 … S k OP SEQ KEY VALUE L2/L3 routing NetChain routing read, write, delete, etc. inserted by head switch reserved port # Ø Application-layer protocol: compatible with existing L2-L4 layers Ø Invoke NetChain with a reserved UDP port 19

In-network key-value storage Match-Action Table Register Array (RA) Match Action 0 1 Key = X Read/Write RA[0] 2 Key = Y Read/Write RA[5] 3 Key = Z Read/Write RA[2] 4 5 Default Drop() Ø Key-value store in a single switch Ø Store and serve key-value items using register arrays [SOSP’17, NetCache] Ø Key-value store in the network Ø Data partitioning with consistent hashing and virtual nodes 20

NetChain routing: segment routing according to chain structure H 0 Write Request Write Reply Client dstIP SC dstIP SC … … S 1 S 2 … … … … = S 0 = 2 = H 0 = 0 S 0 S 1 S 2 Head Replica Tail dstIP SC dstIP SC … … S 2 … … … … = S 1 = 1 = S 2 = 0 22

NetChain routing: segment routing according to chain structure H 0 Client Read Reply Read Request dstIP SC … … S 1 S 0 … = H 0 = 2 dstIP SC … … S 1 S 0 … = S 2 = 2 S 0 S 1 S 2 Head Replica Tail 23

Problem of out-of-order delivery Concurrent Writes Head Replica Tail W 1 : foo=B S 0 S 1 S 2 W 2 : foo=C foo=A foo=A foo=A foo=B foo=C foo=C foo=B foo=B foo=C time Inconsistent values between three replicas Serialization with sequence number 25

Before failure: tolerate f failures with f+1 nodes Handle a switch failure S 0 S 1 S 2 Fast Failover Failure Recovery S 0 S 2 S 0 S 3 S 2 Ø Failover to remaining f nodes Ø Add another switch Ø Tolerate f-1 failures Ø Tolerate f+1 failures again Ø Efficiency: only need to update Ø Consistency: two-phase atomic neighbor switches of failed switch switching Ø Minimize disruption: virtual groups 27

Protocol correctness Invariant. For any key k that is assigned to a chain of nodes [S 1 , S 2 , …, S n ], if 1 ≤ 𝑗 < 𝑘 ≤ 𝑜 (i.e., S i is a predecessor of S j ), then 𝑇𝑢𝑏𝑢𝑓 + , 𝑙 . 𝑡𝑓𝑟 ≥ 𝑇𝑢𝑏𝑢𝑓 + 2 𝑙 . 𝑡𝑓𝑟 . Ø Guarantee strong consistency under packet loss, packet reordering, and switch failures Ø See paper for TLA+ specification 28

Implementation Ø Testbed Ø 4 Barefoot Tofino switches and 4 commodity servers Ø Switch Ø P4 program on 6.5 Tbps Barefoot Tofino Ø Routing: basic L2/L3 routing Ø Key-value store: up to 100K items, up to 128-byte values Ø Server Ø 16-core Intel Xeon E5-2630, 128 GB memory, 25/40 Gbps Intel NICs Ø Intel DPDK to generate query traffic: up to 20.5 MQPS per server 29

Evaluation Ø Can NetChain provide significant performance improvements? Ø Can NetChain scale out to a large number of switches? Ø Can NetChain efficiently handle failures? Ø Can NetChain benefit applications? 30

Evaluation Ø Can NetChain provide significant performance improvements? Ø Can NetChain scale out to a large number of switches? Ø Can NetChain efficiently handle failures? Ø Can NetChain benefit applications? 31

Orders of magnitude higher throughput 1etCKaiQ(Pax) 1etCKaiQ(4) ZooKeeSer 1etCKaiQ(Pax) 1etCKaiQ(4) ZooKeeSer 10 4 10 4 2000 MQPS 2000 MQPS TKrougKSut (0436) TKrougKSut (0436) 10 3 10 3 82 MQPS 82 MQPS 10 2 10 2 10 1 10 1 10 0 10 0 0.15 MQPS 0.15 MQPS 10 -1 10 -1 10 -2 10 -2 0 32 64 96 128 0 20K 40K 60K 80K 100K 9alue 6ize (Byte) 6tore 6ize 32

Orders of magnitude lower latency 2350 us 10 4 ZooKeeSer (ZrLte) ZooKeeSer (read) 10 3 /ateQcy ( µs ) 1etCKaLQ (read/ZrLte) 170 us 10 2 9.7 us 10 1 10 0 10 -3 10 -2 10 -1 10 0 10 1 10 2 10 3 10 4 TKrougKSut (043S) 33

NetChain: Scale-Free Sub-RTT Coordination Xin Jin Xiaozhou Li, - PowerPoint PPT Presentation

NetChain: Scale-Free Sub-RTT Coordination Xin Jin Xiaozhou Li, Haoyu Zhang, Robert Soul, Jeongkeun Lee, Nate Foster, Changhoon Kim, Ion Stoica Conventional wisdom: avoid coordination NetChain: lightning fast coordination enabled by

Visualizing Model Architecture john.sekar@mssm.edu SASB `17 Kinetics ~ Reaction Rules Enz Sub

QUIC passive RTT measurement IETF 99 By Ian Swett Background Almost all of QUIC is encrypted,

Coordination models Essence We are trying to separate computation from coordination; coordination

CDMA 2000 1X RTT Overview Overview Global 3G Evolution Global 3G Evolution Enhancement by CDMA

Communication, Services, and Coordination Communication, Services, and Coordination Communication

Outline Scale-Free Networks Networks Scale-Free Networks Original model Original model

Developing a Pilot Scale Horizontal Sub Developing a Pilot Scale Horizontal Sub Surface Flow

Bio-inspired computation: Clock-free, grid-free, scale-free, and symbol-free (FA2386-12-1-4050)

Timing and Coordination Essential Knowledge 2.E.2 and 2.E.3 Timing and Coordination Timing

4. Coordination and Social Models Part 1: Introduction to Coordination. D) ems Design (MASD

ZooKeeper: Wait-free coordination for Internet-scale systems Xuyang Zhang Zhesheng Xie What is

Rural Training Tracks: Growing your own rural workforce Randall Longenecker MD Project Director,

The Role of Radiation Therapy in the Management of Pharyngeal Cancer WILSON APOLLO, MS, CTR, RTT

Quick-Start for DCCP draft-fairhurst-tsvwg-dccp-qs-00 Gorry Fairhurst Arjuna Sathiaseelan

TLS 1.3: What developers should know about the APIs Daiki Ueno Red Hat Crypto team TLS 1.3:

Removing the MAC Retransmission Times from the RTT in TCP E. Dedu , S. Linck, F. Spies

Function examples int dinky(int x) 000000000040056b <dinky>: { 40056b: lea

61A Lecture 27 Friday, November 8 Announcements Homework 8 due Tuesday 11/12 @ 11:59pm, and

CS32 - Week 5 Umut Oztok July 22, 2016 Umut Oztok CS32 - Week 5 Recursion In order to

Stable Big Bang Formation in General Relativity: The Complete Sub-Critical Regime Jared Speck

Convergence of Cubic Regularization for Nonconvex Optimization under ojasiewicz Property

Better generalization with less data using robust gradient descent Matthew J. Holland 1 Kazushi

JBOORET: an Automated Tool to Recover OO Design and Source Models Hong Mei, Tao Xie, Fuqing Yang

Knowledge Graphs Large ge and complex plex graphs capturing millions of entities and

NetChain: Scale-Free Sub-RTT Coordination Xin Jin Xiaozhou Li, - PowerPoint PPT Presentation

NetChain: Scale-Free Sub-RTT Coordination Xin Jin Xiaozhou Li, Haoyu Zhang, Robert Soul, Jeongkeun Lee, Nate Foster, Changhoon Kim, Ion Stoica Conventional wisdom: avoid coordination NetChain: lightning fast coordination enabled by

Visualizing Model Architecture john.sekar@mssm.edu SASB `17 Kinetics ~ Reaction Rules Enz Sub

QUIC passive RTT measurement IETF 99 By Ian Swett Background Almost all of QUIC is encrypted,

Coordination models Essence We are trying to separate computation from coordination; coordination

CDMA 2000 1X RTT Overview Overview Global 3G Evolution Global 3G Evolution Enhancement by CDMA

Communication, Services, and Coordination Communication, Services, and Coordination Communication

Outline Scale-Free Networks Networks Scale-Free Networks Original model Original model

Developing a Pilot Scale Horizontal Sub Developing a Pilot Scale Horizontal Sub Surface Flow

Bio-inspired computation: Clock-free, grid-free, scale-free, and symbol-free (FA2386-12-1-4050)

Timing and Coordination Essential Knowledge 2.E.2 and 2.E.3 Timing and Coordination Timing

4. Coordination and Social Models Part 1: Introduction to Coordination. D) ems Design (MASD

ZooKeeper: Wait-free coordination for Internet-scale systems Xuyang Zhang Zhesheng Xie What is

Rural Training Tracks: Growing your own rural workforce Randall Longenecker MD Project Director,

The Role of Radiation Therapy in the Management of Pharyngeal Cancer WILSON APOLLO, MS, CTR, RTT

Quick-Start for DCCP draft-fairhurst-tsvwg-dccp-qs-00 Gorry Fairhurst Arjuna Sathiaseelan

TLS 1.3: What developers should know about the APIs Daiki Ueno Red Hat Crypto team TLS 1.3:

Removing the MAC Retransmission Times from the RTT in TCP E. Dedu , S. Linck, F. Spies

Function examples int dinky(int x) 000000000040056b &lt;dinky&gt;: { 40056b: lea

61A Lecture 27 Friday, November 8 Announcements Homework 8 due Tuesday 11/12 @ 11:59pm, and

CS32 - Week 5 Umut Oztok July 22, 2016 Umut Oztok CS32 - Week 5 Recursion In order to

Stable Big Bang Formation in General Relativity: The Complete Sub-Critical Regime Jared Speck

Convergence of Cubic Regularization for Nonconvex Optimization under ojasiewicz Property

Better generalization with less data using robust gradient descent Matthew J. Holland 1 Kazushi

JBOORET: an Automated Tool to Recover OO Design and Source Models Hong Mei, Tao Xie, Fuqing Yang

Knowledge Graphs Large ge and complex plex graphs capturing millions of entities and

Function examples int dinky(int x) 000000000040056b <dinky>: { 40056b: lea