The ABCDs of Paxos Consensus: a set of processes decide on an input value Main application: Replicated state machines Paxos asynchronous consensus algorithm AP Abstract Paxos: generic, non-local version CP Classic Paxos: stopping failures, compare-and-swap 1989: Lamport, Liskov and Oki DP Disk Paxos: stopping failures, read-write 1999: Gafni and Lamport BP Byzantine Paxos: arbitrary failures 1999: Castro and Liskov The paper and slides are at research.microsoft.com/lampson Butler Lampson ABCDs of Paxos: PODC 2001 1
Replicated State Machines Lamport 1978: Time, clocks and the ordering of events … Cast your problem as a deterministic state machine Takes client input requests for state transitions, called steps Performs the steps Returns the output to the client. Make n copies or ‘replicas’ of the state machine. Use consensus to feed all the replicas the same inputs. Steps must be deterministic, local to replica, atomic (use transactions) Recover by replaying the steps (like transactions) Even a read needs a step, unless the result is “as of step n ”. Butler Lampson ABCDs of Paxos: PODC 2001 2
Applications of RSM Reliable, available data storage system Airplane flight control Reflexive applications: Changing quorums of the consensus algorithm Issuing a lease : A lock on part of the state that times out, hence is fault tolerant Leaseholder can work on its state without consensus Like any lock, a lease can have modes or be hierarchical Butler Lampson ABCDs of Paxos: PODC 2001 3
The Idea of Paxos A sequence of views ; get a decision quorum in one of them. Each view v chooses an anchored value c v , equal to any earlier decision. If a quorum accepts the choice, decision! Decision is irrevocable, may be invisible, but is any later view’s choice. Choice is changeable, must be visible if there was a decision c c Processes a a a a a a a a a a a a a a a Actions Start ; Input ; Close a Accept a Finish a ; Anchor Choose STEP a r a a OUTPUT INPUT c v r v Transmit view change normal operation Butler Lampson ABCDs of Paxos: PODC 2001 4
Design Methodology • Communicate only stable predicates: once true always true • Structure the program as a set of atomic actions • Make actions as non-deterministic as possible: weakest guards Allows more freedom for the implementation Makes it clear what is essential • Separate safety, liveness, and performance Safety first, then strengthen guards for liveness and scheduling • Abstraction functions and simulation proofs Butler Lampson ABCDs of Paxos: PODC 2001 5
Notation a for r ( v, a ) Subscripts and superscripts for function arguments: r v State functions used like variables Actions described like this: Name Guard State change Close v c v = nil ∧ x ∈ anchor v → c v : = x Butler Lampson ABCDs of Paxos: PODC 2001 6
Failure Model A set M of processes (machines) A faulty process can send arbitrary messages: F m A stopped process does nothing: S m A failed process is faulty or stopped. State freezes after failure. Limits on failure: Z F = set of sets of processes that can all be faulty Z S = set of sets of processes that can all be stopped Z FS = set of sets of processes that can all be failed Examples: Fail-stop: n processes, Z F ={}, Z S =Z FS = any set of size < ( n +1)/2 Byzantine: n processes, Z F = Z S =Z FS = any set of size < ( n +1)/3 Intel-Microsoft: n I + n M processes, Z F =any subset of one side Butler Lampson ABCDs of Paxos: PODC 2001 7
Quorums and Predicates Quorum set Q : set of sets of processes; q in ⇒ any superset in. State predicate g . Predicate on processes G , so G m is a predicate. A stable predicate once true remains true. Q # G : A predicate G appears to hold in quorum Q , { m | G m ∨ F m } ∈ Q * =x ] for Q #( λ m | r v m = x ). Shorthand: Q [ r v A good quorum is not all faulty: Q ~ F = { q | q ∉ Z F } Q 1 and Q 2 exclusive : Q 1 quorum for G ⇒ no Q 2 quorum for its negation. Means q 1 ∩ q 2 ∈ Q ~ F for any q 1 and q 2 . Example: size > ( n + f )/2 a =x ⇒ ~ ( r v * =x ] ⇒ ~Q 2 [ r v a =out ) to global Q 1 [ r v * =out ] Lift local r v Q + : ensures Q even after failures: q + – z FS ∈ Q for any q + , z FS A live quorum has Q + ≠ {} Butler Lampson ABCDs of Paxos: PODC 2001 8
Specification for Consensus type X = ... values to decide on : ( X ∪ { nil }) := nil var d Decision input : set X := {} Name Guard State change input := input ∪ { x } Input ( x ) Decision : X d ≠ nil → ret d d = nil ∧ x ∈ input → d := x Decide Butler Lampson ABCDs of Paxos: PODC 2001 9
The Idea of Paxos A sequence of views ; get a decision quorum in one of them. Each view v chooses an anchored value c v : equals any earlier decision. If a quorum accepts the choice, decision! Decision is irrevocable, may be invisible, but is any later view’s choice. Choice is changeable, must be visible to Anchor if there was a decision. c c Processes a a a a a a a a a a a a a a a Actions Start ; Input ; Close a Accept a Finish a ; Anchor Choose STEP a r a a OUTPUT INPUT c v r v Transmit view change normal operation Butler Lampson ABCDs of Paxos: PODC 2001 10
Abstract Paxos—AP: State State Non-local Agents’ State functions View is r v d 1 1: r v c v d 1 * =x ] Q dec [ r v decided x x 2 2: r v input d 2 * =out ] Q out [ r v out nil out 3 3: r v active v d 3 nil open else nil Q dec and Q out exclusive var = const is stable for all these except input , and x ∈ input is stable. Butler Lampson ABCDs of Paxos: PODC 2001 11
AP: Data Flow view change to later views a =nil Close v x ∈ anchor v Choose v c v Accept v r v =c v Finish v d a =r v r u a := out a := c v d a := r v c v := x r u r v for u < v Client INPUT ( x ) x ∈ input c c Processes a a a a a a a a a a a a a a a Actions Start ; Input ; Close a Accept a Finish a ; Anchor Choose STEP a r a a OUTPUT INPUT c v r v Transmit view change normal operation Butler Lampson ABCDs of Paxos: PODC 2001 12
Example a r v b r v c a r v b r v c c v r v c v r v View 1 7 7 out out 8 8 out out View 2 8 out 8 out 9 9 out 9 View 3 9 out out 9 9 out out 9 input ∩ = {7, 8, 9} seeing a, b, c {9} no matter what ⊇ {8} seeing a, b quorum we see anchor 4 ⊇ {9} seeing a, c or b, c Two runs of AP with agents a , b , c , two agents in a quorum, input = {7, 8, 9} Butler Lampson ABCDs of Paxos: PODC 2001 13
Anchoring invariant r v = x ∧ r u = x' ⇒ x = x' all results agree = ∀ x' , u | r v = x ∧ r u = x' ⇒ x = x' assume u<v a ∈ { x, out } = r v = x ⇒ (∀ u < v , x' ≠ x | ~ Q dec [ r u r u * =x' ]) ⇒ ~ ( r u a = x' ) ⇐ r v = x ⇒ ( ∀ u < v | Q out [ r u * ∈ { x,out }]) sfunc anchor v = { x | ( ∀ u < v | Q out [ r u * ∈ { x,out }])} = { x | ( ∀ w | v 0 = w < u ⇒ Q out [ r w * ∈ { x,out }])} = anchor u ∩ { x | Q out [ r u * ∈ { x,out }]} ∩ { x | ( ∀ w | u 0 < w < v ⇒ Q out [ r w * ∈ { x,out }])} = X if out u,v = anchor u ∩ { x | Q out [ r u * ∈ { x,out }]}) if out u,v ⊇ if out v 0 ,v then X elseif out u,v ∧ r u a = x then { x } else {} where out u,v = ( ∀ w | u < w < v ⇒ r w = out ) Butler Lampson ABCDs of Paxos: PODC 2001 14
AP: Algorithm → active v := true Start v u<v too slow → for all u < v do a Close v active v post u<v ⇒ r u a ≠ nil a = nil if r u a := out then r u anchor v = anchor u ∩ { x | Q out [ r u * ∈ { x,out }]}) if out u,v Anchor v anchor v ≠ {} → no state change → c v := x a = nil Choose v c v ∧ x ∈ input ∩ anchor v → r v a r v a = nil a := c v ; Close v a Accept v ∧ c v ≠ nil a r v ∈ X → d a := r v Finish v to later viewsx a =nil Close v x ∈ anchor v Choose v c v Accept v r v =c v Finish v d a =r v r u a := out a := c v d a := r v c v := x r u r v for u < v Butler Lampson ABCDs of Paxos: PODC 2001 15
AP: Liveness Choose v must see an element of input ∩ anchor v . Recall anchor v = anchor u ∩ { x | Q out [ r u * ∈ { x,out }]} if out u,v ⊇ if out v 0 ,v then X elseif out u,v ∧ r u a = x then { x } else {} a ≠ nil for all u < v . a , an OK agent a has r u After Close v a = x for some OK a . So if Q out is live, we see either u < v is out, or r u a = c u ∈ input ∩ anchor u But r u a is what we want If we know a is OK, then r u With faults (in BP), we might not know. But if anchor u is visible, that is enough. Still not live if new views start too fast. Butler Lampson ABCDs of Paxos: PODC 2001 16
Optimizations Fixed-size agent state: a = a don’t know out r w x last nil | | | a a view v 0 vX last v last Successive steps: Because anchor v doesn’t depend on input , can compute it for lots of steps at once. This is called a view change One view change is enough for any number of steps Can batch steps, with one Paxos/batch. Can run steps in parallel, subject to external consistency. Butler Lampson ABCDs of Paxos: PODC 2001 17
Disk Paxos—DP The goal—Replace the conditional writes in Close and Accept with simple writes. a = nil ∧ c v ≠ nil → r v a r v a := c v ; Close v a Accept v a with rx v a and ro v a . The idea—Replace r v a c v ≠ nil → rx v a := c v ; Close v a Accept v a → for all u < v do ro u a := out Close v active v a as a history variable. Abstract it to AP’s r v a . Proof: Keep r v This invariant makes it work (sometimes with an extra view). a = ? ∧ ro v a = ⇒ a rx v r v nil nil = nil nil out = out x nil = x ≠ nil x out Butler Lampson ABCDs of Paxos: PODC 2001 18
Recommend
More recommend