the abcds of paxos
play

The ABCDs of Paxos Replicated state machines Consensus: a set of - PowerPoint PPT Presentation

The ABCDs of Paxos Replicated state machines Consensus: a set of processes decide on an input value Paxos asynchronous consensus algorithm AP Abstract Paxos: generic, non-local version CP Classic Paxos: stopping failures, compare-and-swap


  1. The ABCDs of Paxos Replicated state machines Consensus: a set of processes decide on an input value Paxos asynchronous consensus algorithm AP Abstract Paxos: generic, non-local version CP Classic Paxos: stopping failures, compare-and-swap 1989: Lamport, Liskov and Oki DP Disk Paxos: stopping failures, read-write 1999: Gafni and Lamport BP Byzantine Paxos: arbitrary failures 1999: Castro and Liskov The paper is at research.microsoft.com/lampson Butler Lampson ABCDs of Paxos: PODC 2001 1

  2. Replicated State Machines Lamport 1978: Time, clocks and the ordering of events … Cast your problem as a deterministic state machine Takes client input requests for state transitions, called steps Performs the steps Returns the output to the client. Make n copies or ‘replicas’ of the state machine. Use consensus to feed all the replicas the same inputs. Steps must be deterministic, local to replica, atomic (use transactions) Recover by replaying the steps (like transactions) Even a read needs a step, unless the result is “as of step n ”. Butler Lampson ABCDs of Paxos: PODC 2001 2

  3. Applications of RSM Reliable, available data storage system Airplane flight control Reflexive: Changing quorums of the consensus algorithm Issuing a lease : A lock on part of the state that times out, hence is fault tolerant Leaseholder can work on its state without consensus Like any lock, a lease can have modes or be hierarchical Butler Lampson ABCDs of Paxos: PODC 2001 3

  4. The Idea of Paxos A sequence of views ; get a decision quorum in one of them. Each view v chooses an anchored value c v : equals any earlier decision. If a quorum accepts the choice, decision! Decision is irrevocable, may be invisible, but is any later view’s choice. Choice is changeable, must be visible c c Processes a a a a a a a a a a a a a a a Actions Start ; Input ; Close a Accept a Finish a ; Anchor Choose STEP a a OUTPUT r a INPUT c v r v Transmit view change normal operation Butler Lampson ABCDs of Paxos: PODC 2001 4

  5. Design Methodology • Communicate only stable predicates: once true always true • Structure program as a set of atomic actions • Make actions as non-deterministic as possible: weakest guards Allows more freedom for the implementation Makes it clear what is essential • Separate safety, liveness, and performance Safety first, then strengthen guards for liveness and scheduling • Abstraction functions and simulation proofs Butler Lampson ABCDs of Paxos: PODC 2001 5

  6. Notation a for r ( v, a ) Subscripts and superscripts for function arguments: r v State functions used like variables Actions described like this: Name Guard State change Close v c v = nil ∧ x ∈ anchor v ? c v : = x Butler Lampson ABCDs of Paxos: PODC 2001 6

  7. Failure Model A set M of processes (machines) A faulty process can send arbitrary messages: F m A stopped process does nothing: S m A failed process is faulty or stopped. Failure doesn’t lose state. Limits on failure: Z F = set of sets of processes that can all be faulty Z S = set of sets of processes that can all be stopped Z FS = set of sets of processes that can all be failed Examples: Fail-stop: n processes, Z F ={}, Z S =Z FS = any set of size < ( n +1)/2 Byzantine: n processes, Z F = Z S =Z FS = any set of size < ( n +1)/3 Intel-Microsoft: n I + n M processes, Z F =any subset of one side Butler Lampson ABCDs of Paxos: PODC 2001 7

  8. Quorums and Predicates Quorum: monotonic set of sets of processes: q in ⇒ any superset in. Predicates g . Predicates on processes G , so G m is a predicate. A stable predicate once true remains true. A predicate G holds in a quorum Q : Q # G = { m | G m ∨ F m } ∈ Q m = x ). * =x ] for Q #(? m | r v Shorthand: Q [ r v A good quorum is not all faulty: Q~ F = { q | q ∉ Z F } Q and Q' exclusive : Q quorum for G ⇒ no Q' quorum for its negation. Means q ∩ q' ∈ Q~ F for any two quorums. Ex: size > ( n + f )/2 Lifts local exclusion G 1 ⇒ ~G 2 to global: Q # G 1 ⇒ ~Q' # G 2 Q + : ensures Q even after failures: q + – z FS ∈ Q for any q + , z FS A live quorum has Q + ? {} Butler Lampson ABCDs of Paxos: PODC 2001 8

  9. Specification type X = ... values to decide on : ( X ∪ { nil }) := nil var d Decision input : set X := {} Name Guard State change input := input ∪ { x } Input ( x ) Decision : X d ? nil ? ret d d = nil ∧ x ∈ input ? d := x Decide Butler Lampson ABCDs of Paxos: PODC 2001 9

  10. The Idea of Paxos A sequence of views ; get a decision quorum in one of them. Each view v chooses an anchored value c v : equals any earlier decision. If a quorum accepts the choice, decision! Decision is irrevocable, may be invisible, but is any later view’s choice. Choice is changeable, must be visible c c Processes a a a a a a a a a a a a a a a Actions Start ; Input ; Close a Accept a Finish a ; Anchor Choose STEP a a OUTPUT r a INPUT c v r v Transmit view change normal operation Butler Lampson ABCDs of Paxos: PODC 2001 10

  11. Abstract Paxos—AP: State Non-local Agents State functions View is r v d 1 c v 1: r v d 1 * =x ] Q dec [ r v x x decided 2 input 2: r v d 2 * =out ] Q out [ r v out nil out 3 active v 3: r v d 3 nil nil open else Butler Lampson ABCDs of Paxos: PODC 2001 11

  12. AP: Data Flow to later views a =nil Close v x ∈ anchor v Choose v c v Accept v r v =c v Finish v d a =r v r u a := out a := c v d a := r v r u c v := x r v for u < v Each value is nil or = the previous one Client INPUT x x ∈ input c c Processes a a a a a a a a a a a a a a a Actions Start ; Input ; Close a Accept a Finish a ; Anchor Choose STEP a r a a OUTPUT INPUT c v r v Transmit view change normal operation Butler Lampson ABCDs of Paxos: PODC 2001 12

  13. Example a r v b r v c a r v b r v c c v r v c v r v View 1 7 7 out out 8 8 out out View 2 8 8 out out 9 9 out 9 View 3 9 out out 9 9 out out 9 input ∩ = {7, 8, 9} seeing a, b, c {9} no matter what ⊇ {8} seeing a, b quorum we see anchor 4 ⊇ {9} seeing a, c or b, c Two runs of AP with agents a , b , c , two agents in a quorum, input = {7, 8, 9} Butler Lampson ABCDs of Paxos: PODC 2001 13

  14. Anchoring invariant r v = x ∧ r u = x' ⇒ x = x' all results agree = ∀ x' , u | r v = x ∧ r u = x' ⇒ x = x' assume u<v a ∈ { x, out } = r v = x ⇒ (∀ u < v , x' ? x | ~ Q dec [ r u r u * =x' ]) ⇒ ~ ( r u a = x' ) ⇐ r v = x ⇒ ( ∀ u < v | c u = x ∨ Q out [ r u * ∈ { x,out }]) sfunc anchor v = { x | ( ∀ u < v | c u = x ∨ Q out [ r u * ∈ { x,out }])} = { x | ( ∀ w | v 0 = w < u ⇒ c w = x ∨ Q out [ r w * ∈ { x,out }])} = anchor u ∩ { x | c u = x ∨ Q out [ r u * ∈ { x,out }]} = X if out u,v ∩ { x | ( ∀ w | u 0 < w < v ⇒ c w = x ∨ Q out [ r w * ∈ { x,out }])} = { x | c u = x } ∪ ( anchor u ∩ { x | Q out [ r u * ∈ { x,out }]}) since c u ∈ anchor u if out u,v ⊇ if out u,v ∧ r u a = x then { x } elseif out v 0 ,v then X else {} where out u,v = ( ∀ w | u < w < v ⇒ r w = out ) Butler Lampson ABCDs of Paxos: PODC 2001 14

  15. AP: Algorithm u<v too slow ? active v := true Start v a active v ? for all u < v do post u<v Close v ⇒ r u a = nil a ? nil if r u a := out then r u anchor v = { x | c u = x } ∪ ( anchor u ∩ { x | Q out [ r u * ∈ { x,out }]}) if out u,v Anchor v anchor v ? {} ? no state change a = nil Choose v c v ? c v := x ∧ x ∈ input ∩ anchor v a r v a = nil a := c v ; Close v a ? r v Accept v ∧ c v ? nil a r v ∈ X ? d a := r v Finish v to later views a =nil Close v x ∈ anchor v Choose v c v Accept v r v =c v Finish v d a =r v r u a := out a := c v d a := r v r u c v := x r v for u < v Butler Lampson ABCDs of Paxos: PODC 2001 15

  16. AP: Liveness Choose must see an element of input ∩ anchor v . Recall anchor v = { x | c u = x } ∪ ( anchor u ∩ { x | Q out [ r u * ∈ { x,out }]}) ⊇ if out u,v ∧ r u a = x then { x } elseif out v 0 ,v then X else {} a , an OK agent a has r u a ? nil for all u < v . After Close v a = x for some OK a . So if Q out is live, we see either u < v is out, or r u a = c u ∈ input ∩ anchor u But r u a is what we want If we know a is OK, then r u With faults (in BP), we might not know. But if anchor u is visible, that is enough. Butler Lampson ABCDs of Paxos: PODC 2001 16

  17. Optimizations Fixed-size agent state: a = a r w don’t know x last out nil | | | a a view v 0 vX last v last Successive steps: Because anchor v doesn’t depend on input , can compute it for lots of steps at once. This is called a view change One view change is enough for any number of steps Can batch steps with one Paxos/batch. Can run steps in parallel, subject to external consistency. Butler Lampson ABCDs of Paxos: PODC 2001 17

  18. Disk Paxos—DP The goal—Replace the conditional writes in Close and Accept with simple writes. a = nil ∧ c v ? nil a := c v ; Close v a a r v ? r v Accept v a with rx v a and ro v a . The idea—Replace r v a := c v ; Close v a a c v ? nil ? rx v Accept v a := out a active v ? for all u < v do ro u Close v a as a history variable. Abstract it to AP’s r v a . Proof: Keep r v This invariant makes it work (sometimes with an extra view). a = ? a = ⇒ a ∧ ro v rx v r v nil nil = nil nil out = out x nil = x x out ? nil Butler Lampson ABCDs of Paxos: PODC 2001 18

Recommend


More recommend