Reaching reliable agreement in an unreliable world Heidi Howard - PowerPoint PPT Presentation

Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk twitter: @heidiann blog: hh360.user.srcf.net Cambridge Tech Talks 17th November 2015 slides: hh360.user.srcf.net/slides/cam_tech_talks.pdf 1

Distributed Systems in Practice • Social networks • Banking • Government information systems • E-commerce • Web servers 2

Distributed Systems in Theory “… a collection of distinct processes which are spatially operated and which communicate with one another by exchanging messages … the message delay is not negligible compared to the time between events in a single process” [CACM ‘78] Leslie Lamport 3

Introducing Alice Alice is new graduate of to the world of work. She joins a cool new start up, where she is responsible for a distributed system. 4

Key Value Store A 7 B 2 C 1 5

Key Value Store A? 7 A 7 B 2 C 1 6

Key Value Store A? 7 A 7 B=5 B 2 C 1 7

Key Value Store A? 7 A 7 B=5 B 5 C 1 OK 8

Key Value Store A? 7 A 7 B=5 B 5 C 1 OK B? 5 9

Requirements • Scalability - High throughout processing of operations. • Latency - Low latency commit of operation as perceived by the client. • Fault-tolerance - Availability in the face of machine and network failures. • Linearizable semantics - Operate as if a single server system. 10

Single Server System A 7 B 2 Server Client 1 Client 2 Client 3 11

Single Server System A 7 B 2 Server A? 7 Client 1 Client 2 Client 3 12

Single Server System A 7 3 B 2 Server B=3 OK Client 1 Client 2 Client 3 13

Single Server System A 7 3 B 2 Server 3 B? Client 1 Client 2 Client 3 14

Single Server System Pros Cons • easy to deploy • system unavailable if server or network fails • low latency (1 RTT in • throughput limited to one common case) server • requests executed in-order 15

Single Server System (v.2) Pros Cons • easy to deploy • system unavailable if server fails • low latency (1 RTT in common case) • throughput limited to one server • linearizable semantics • durability with write-ahead logging • partition tolerance with retransmission & command cache 16

Backups A 7 B 2 Client 1 A? Backup 1 7 A 7 B 2 A 7 Primary B 2 Backup 1 Client 2 A 7 B 2 Backup 1 aka Primary backup replication 17

Backups A 7 B 2 Client 1 Backup 1 A 7 B 2 B=1 A 7 Primary B 2 Backup 1 Client 2 A 7 B 2 Backup 1 aka Primary backup replication 18

Backups A 7 B 2 Client 1 Backup 1 A 7 B 1 A 7 B 1 B=1 A 7 Primary B 2 Backup 1 Client 2 A 7 B 2 Backup 1 aka Primary backup replication 19

Backups A 7 B 1 OK Client 1 Backup 1 A 7 OK B 1 A 7 Primary B 1 OK Backup 1 OK Client 2 A 7 B 1 Backup 1 aka Primary backup replication 20

Big Gotcha We are assuming total ordered broadcast 21

Totally Ordered Broadcast (aka atomic broadcast) the guarantee that messages are received reliably and in the same order by all nodes. 22

Intro (Review) So far we have: • Defined our notion of a distributed system • Introduced an example distributed system (Alice and her key-value store) • Seen that straw man approaches to building this system are not sufficient Any questions so far? 23

Doing the Impossible 24

CAP Theorem Pick 2 of 3: • Consistency • Availability Eric Brewer • Partition tolerance Proposed by Brewer in 1998, still debated and regarded as misleading. [Brewer’12] [Kleppmann’15] 25

FLP Impossibility It is impossible to guarantee consensus when messages may be delay if even one node may fail. [JACM’85] 26

Consensus is impossible [PODC’89] Nancy Lynch 27

Aside from Simon PJ Don’t drag your reader or listener through your blood strained path. Simon Peyton Jones 28

Paxos Paxos is at the foundation of (almost) all distributed consensus protocols. It is a general approach of using two phases and majority quorums. It takes much more to construct a complete fault- tolerance distributed systems. 29

Consensus is hard 30

Doing the Impossible (Review) In this section, we have: • Learned about various impossibly results in the field such as CAP theorem and the FLP results • Introduced the fundamental (yet famously difficult to understand) Paxos algorithm Any questions so far? 31

A raft in the sea of confusion 32

Case Study 1: Raft Raft, the understandable replication algorithm. Provides us with linearisable semantics and in the best case 2 RTT latency. A complete(ish) architecture for making our application fault-tolerance. 33

State Machine Replication A 7 A 7 B 2 B 2 Server Server A 7 B 2 Server B=3 Client 34

State Machine Replication A 7 A 7 B 2 B 2 Server Server A 7 B 2 Server B=3 Client 35

State Machine Replication A 7 A 7 3 B 2 3 B 2 Server Server A 7 B 2 3 Server Client 36

State Machine Replication A 7 A 7 3 B 2 3 B 2 Server Server A 7 B 2 3 Server B=3 Client 37

Leadership Startup/ Step down Restart Step down Timeout Win Follower Candidate Leader Timeout 38

Ordering Each node stores is own perspective on a value known as the term. Each message includes the sender’s term and this is checked by the recipient. The term orders periods of leadership to aid in avoiding conflict. Each has one vote per term, thus there is at most one leader per term. 39

ID: 1 Term: 0 Vote: n ID: 5 ID: 2 Term: 0 Term: 0 Vote: n Vote: n ID: 4 ID: 3 Term: 0 Term: 0 Vote: n Vote: n 40

Leadership Startup/ Step down Restart Step down Timeout Win Follower Candidate Leader Timeout 41

ID: 1 Term: 0 Vote: n ID: 5 ID: 2 Term: 0 Term: 0 Vote: n Vote: n ID: 4 ID: 3 Term: 1 Term: 0 Vote: me Vote: n Vote for me in term 1! 42

ID: 1 Term: 1 Vote: 4 ID: 5 ID: 2 Term: 1 Term: 1 Vote: 4 Vote: 4 ID: 4 ID: 3 Term: 1 Term: 1 Vote: me Vote: 4 Ok! 43

Replication Each node has a log of client commands and a index into this representing which commands have been committed. A command is consider as committed when the leader has replicated it into the logs of a majority of servers. 44

Evaluation • The leader is a serious bottleneck -> limited scalability • Can only handle the failure of a minority of nodes • Some rare network partitions render protocol in livelock 45

Raft in the sea of confusion (Review) In this section, we have: • Introduced the Raft algorithm • Seen how Raft elects a leader between a collect of nodes • Evaluated the Raft algorithm Any questions so far? 46

Beyond Raft 47

Case Study 2: Tango Tango is designed to be a scalable replication protocol. It’s a variant of chain replication. It is leaderless and pushes more work onto clients 48

Simple Replication 0 0 0 A=4 A=4 A=4 Server 1 Server 2 Server 3 B=5 A 7 A 4 Sequencer B 2 B 2 Next: 1 Client 1 Client 2 49

Simple Replication 0 0 0 A=4 A=4 A=4 Server 1 Server 2 Server 3 B=5 A 7 A 4 Sequencer 1 B 2 B 2 Next: 2 Client 1 Client 2 Next? 50

Simple Replication 0 1 0 0 A=4 B=5 A=4 A=4 Server 1 Server 2 Server 3 OK B=5 B=5 @ 1 A 7 A 4 Sequencer B 2 B 2 Next: 2 Client 1 Client 2 51

Simple Replication 0 1 0 0 1 A=4 B=5 A=4 A=4 B=5 Server 1 Server 2 Server 3 B=5 @ 1 OK A 7 A 4 Sequencer B 2 B 2 Next: 2 Client 1 Client 2 52

Simple Replication 0 1 0 0 1 1 B=5 A=4 B=5 A=4 A=4 B=5 Server 1 Server 2 Server 3 B=5 @ 1 OK A 7 A 4 Sequencer B 2 B 2 Next: 2 Client 1 Client 2 53

Simple Replication 0 1 0 0 1 1 B=5 A=4 B=5 A=4 A=4 B=5 Server 1 Server 2 Server 3 A 7 A 4 Sequencer B 2 B 5 Next: 2 Client 1 Client 2 54

Beyond Raft (Review) In this section, we have: • Introduced an alternative algorithm, known as Tango • Tango is scalable, as the leader is not longer the bottleneck but has high latency Any questions so far? 55

Next Steps 56

wait… we’re not finished yet! 57

Requirements • Scalability - High throughout processing of operations. • Latency - Low latency commit of operation as perceived by the client. • Fault-tolerance - Availability in the face of machine and network failures. • Linearizable semantics - Operate as if a single server system. 58

Many more examples • Raft [ATC’14] - Good starting point, understandable algorithm from SMR + multi-paxos variant • Tango [SOSP’13] - Scalable algorithm for f+1 nodes, uses CR + multi-paxos variant • VRR [MIT-TR’12] - Raft with round-robin leadership & more distributed load • Zookeeper [ATC'10] - Primary backup replication + atomic broadcast protocol (Zab [DSN’11]) • EPaxos [SOSP’13] - leaderless Paxos varient for WANs 59

Reaching reliable agreement in an unreliable world Heidi Howard - PowerPoint PPT Presentation

Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk twitter: @heidiann blog: hh360.user.srcf.net Cambridge Tech Talks 17th November 2015 slides: hh360.user.srcf.net/slides/cam_tech_talks.pdf 1

Using IP to Underpin 5G Networks Making the Unreliable Reliable Adrian Farrel

Unreliable Failure Detectors for Reliable Distributed Systems Mikel Larrea Departamento de

Probabilistic Logics and the Synthesis of Reliable Organisms from Unreliable Components John Z.

Vicis: A Reliable Network for Unreliable Silicon Andrew DeOrio, David Fick, Jin Hu, Valeria

1 rdt2.0: FSM specification rdt2.0: operation with no errors rdt_send(data) rdt_send(data)

LECTURE 7: when they are self interested? In an extreme case (zero sum Reaching Agreements

Computing over Unreliable Computing over Unreliable C C Communication Networks Communication

Unreliable Publish/Subscribe Protocols Georgios Bouloukakis 1,3 , Ajay Kattepur 2 , Nikolaos

Reaching Agreement: Auctions Contents Introductjon Auctjon Parameters English, Dutch,

Reaching & Understanding Consumers In A Digital World Francis Nicholas Group Digital

Unreliable Datagram Extension to QUIC draft-pauly-quic-datagram-00 Tommy Pauly , Eric Kinnear,

Pistonless Pump for Safe and Reliable Space Access A& D Forum 2016 Steve Harrington, Ph.D.

REACHING OUT On Missions REACHING OUT On Missions Agenda 1. Overview on Missions/Community

reaching for policy and practice for net positive energy in the built environment reaching for

Agreement and Disagreement in a Non-Classical World Adam Brandenburger, Patricia

REACHING A CULTURE REACHING A CULTURE THAT HAS THAT HAS ABANDONED SCRIPTURE ABANDONED

World Climate: Negotiating a Global Climate Agreement Agenda 1. Introduction and schedule 2.

Reaching Definitions Global Opt: Reaching Definitions Concept of definition and use Using

Long-term energy modeling for a decarbonized world: an assessment of the Paris Agreement with an

Highly-Available Applications on Unreliable Infrastructure: Microservice Architectures in

SKF A world of reliable rotation Niclas Rosenlew, CFO Bond issue, June 2020 Investor

Safe and Reliable Test Results Handling Neil Houston Clinical Lead The World Health

Reaching as many viewers as possible using only libre video technologies. Reaching as

LVMH reaches an agreement with Belmond to increase its presence in the ultimate hospitality world

Reaching reliable agreement in an unreliable world Heidi Howard - PowerPoint PPT Presentation

Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk twitter: @heidiann blog: hh360.user.srcf.net Cambridge Tech Talks 17th November 2015 slides: hh360.user.srcf.net/slides/cam_tech_talks.pdf 1

Using IP to Underpin 5G Networks Making the Unreliable Reliable Adrian Farrel

Unreliable Failure Detectors for Reliable Distributed Systems Mikel Larrea Departamento de

Probabilistic Logics and the Synthesis of Reliable Organisms from Unreliable Components John Z.

Vicis: A Reliable Network for Unreliable Silicon Andrew DeOrio, David Fick, Jin Hu, Valeria

1 rdt2.0: FSM specification rdt2.0: operation with no errors rdt_send(data) rdt_send(data)

LECTURE 7: when they are self interested? In an extreme case (zero sum Reaching Agreements

Computing over Unreliable Computing over Unreliable C C Communication Networks Communication

Unreliable Publish/Subscribe Protocols Georgios Bouloukakis 1,3 , Ajay Kattepur 2 , Nikolaos

Reaching Agreement: Auctions Contents Introductjon Auctjon Parameters English, Dutch,

Reaching &amp; Understanding Consumers In A Digital World Francis Nicholas Group Digital

Unreliable Datagram Extension to QUIC draft-pauly-quic-datagram-00 Tommy Pauly , Eric Kinnear,

Pistonless Pump for Safe and Reliable Space Access A&amp; D Forum 2016 Steve Harrington, Ph.D.

REACHING OUT On Missions REACHING OUT On Missions Agenda 1. Overview on Missions/Community

reaching for policy and practice for net positive energy in the built environment reaching for

Agreement and Disagreement in a Non-Classical World Adam Brandenburger, Patricia

REACHING A CULTURE REACHING A CULTURE THAT HAS THAT HAS ABANDONED SCRIPTURE ABANDONED

World Climate: Negotiating a Global Climate Agreement Agenda 1. Introduction and schedule 2.

Reaching Definitions Global Opt: Reaching Definitions Concept of definition and use Using

Long-term energy modeling for a decarbonized world: an assessment of the Paris Agreement with an

Highly-Available Applications on Unreliable Infrastructure: Microservice Architectures in

SKF A world of reliable rotation Niclas Rosenlew, CFO Bond issue, June 2020 Investor

Safe and Reliable Test Results Handling Neil Houston Clinical Lead The World Health

Reaching as many viewers as possible using only libre video technologies. Reaching as

LVMH reaches an agreement with Belmond to increase its presence in the ultimate hospitality world

Reaching & Understanding Consumers In A Digital World Francis Nicholas Group Digital

Pistonless Pump for Safe and Reliable Space Access A& D Forum 2016 Steve Harrington, Ph.D.