just say no to paxos overhead replacing consensus with
play

Just Say NO to Paxos Overhead: Replacing Consensus with Network - PowerPoint PPT Presentation

Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering Jialin Li , Ellis Michael, Naveen Kr. Sharma, Adriana Szekeres, Dan R. K. Ports Server failures are the common case in data centers Server failures are the common case


  1. Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering Jialin Li , Ellis Michael, Naveen Kr. Sharma, Adriana Szekeres, Dan R. K. Ports

  2. Server failures are the common case in data centers

  3. Server failures are the common case in data centers

  4. Server failures are the common case in data centers

  5. Server failures are the common case in data centers

  6. State Machine Replication Operation Operation Operation A A A Operation Operation Operation B B B Operation Operation Operation C C C

  7. State Machine Replication Operation Operation Operation A A A Operation Operation Operation B B B Operation Operation Operation C C C

  8. State Machine Replication Operation Operation Operation A A A Operation Operation Operation B B B Operation Operation Operation C C C

  9. Paxos for state machine replication request prepare prepareok reply Client Leader Replica Replica Replica

  10. Paxos for state machine replication request prepare prepareok reply Client Leader Replica Throughput Bottleneck Replica Replica

  11. Paxos for state machine replication request prepare prepareok reply Client Leader Replica Throughput Bottleneck Replica Replica Latency Penalty

  12. Can we eliminate Paxos overhead? Performance overhead due to worst-case network assumptions • valid assumptions for the Internet • data center networks are different What properties should the network have to enable faster replication?

  13. Network properties determine replication complexity Messages may be: • dropped Asynchronous • reordered • delivered with Network arbitrary latency Paxos • Paxos protocol on every operation • High performance cost

  14. Network properties determine replication complexity All replicas: • receive the same set Reliability Asynchronous of messages • receive them in the Ordering Network same order Paxos • Paxos protocol on every operation • High performance cost

  15. Network properties determine replication complexity All replicas: • receive the same set Reliability Asynchronous of messages • receive them in the Ordering Network same order Paxos • Replication is trivial • Paxos protocol on every operation • High performance cost

  16. Network properties determine replication complexity All replicas: • receive the same set Reliability Asynchronous of messages • receive them in the Ordering Network same order Paxos • Replication is trivial • Paxos protocol on every operation • Network implementation • High performance cost has the same complexity as Paxos

  17. Reliability Asynchronous Ordering Network Paxos Weak Strong Network Guarantee

  18. Reliability Asynchronous Ordering Network Paxos Weak Strong Network Guarantee

  19. Can we build a network model that: • provides performance benefits • can be implemented more efficiently Reliability Asynchronous Ordering Network Paxos Weak Strong Network Guarantee

  20. This Talk

  21. This Talk A new network model with near-zero-cost implementation: Ordered Unreliable Multicast

  22. This Talk A new network model with near-zero-cost implementation: Ordered Unreliable Multicast +

  23. This Talk A new network model with near-zero-cost implementation: Ordered Unreliable Multicast + A coordination-free replication protocol: Network-Ordered Paxos

  24. This Talk A new network model with near-zero-cost implementation: Ordered Unreliable Multicast + A coordination-free replication protocol: Network-Ordered Paxos =

  25. This Talk A new network model with near-zero-cost implementation: Ordered Unreliable Multicast + A coordination-free replication protocol: Network-Ordered Paxos = replication within 2% throughput overhead

  26. Outline 1. Background on state machine replication and data center network 2. Ordered Unreliable Multicast 3. Network-Ordered Paxos 4. Evaluation

  27. Towards an ordered but unreliable network Key Idea: Separate ordering from reliable delivery in state machine replication Network provides ordering Replication protocol handles reliability

  28. OUM Approach • Designate one sequencer in the network • Sequencer maintains a counter for each OUM group 1. Forward OUM messages to the sequencer 2. Sequencer increments counter and writes counter value into packet headers 3. Receivers use sequence numbers to detect reordering and message drops

  29. Ordered Unreliable Multicast Counter: 0 Senders Receivers

  30. Ordered Unreliable Multicast 1 2 Counter: 1 2 2 0 1 1 2 Senders Receivers

  31. Ordered Unreliable Multicast 1 2 3 4 Counter: 4 1 2 4 2 0 3 1 1 2 3 4 Senders Receivers

  32. Ordered Unreliable Multicast 1 2 3 4 Counter: 4 1 2 DROP 4 2 0 3 1 1 2 3 4 Senders Receivers

  33. Ordered Unreliable Multicast Ordered Multicast: 1 2 3 4 no coordination required to determine order of messages Counter: 4 1 2 DROP 4 2 0 3 1 1 2 3 4 Senders Receivers

  34. Ordered Unreliable Multicast Ordered Multicast: 1 2 3 4 no coordination required to determine order of messages Counter: 4 1 2 DROP 4 2 0 3 1 Drop Detection: coordination only required when messages are dropped 1 2 3 4 Senders Receivers

  35. Sequencer Implementations Middlebox In-switch End-host prototype sequencing sequencing • Cavium Octeon • next generation • no specialized network processor programmable hardware required • connects to root • incurs higher switches • implemented in switches latency penalties • adds 8 us latency • similar throughput P4 • nearly zero cost benefits

  36. Sequencer Implementations Middlebox In-switch End-host prototype sequencing sequencing • Cavium Octeon • next generation • no specialized network processor programmable hardware required • connects to root • incurs higher switches • implemented in switches latency penalties • adds 8 us latency • similar throughput P4 • nearly zero cost benefits

  37. Sequencer Implementations Middlebox In-switch End-host prototype sequencing sequencing • Cavium Octeon • next generation • no specialized network processor programmable hardware required • connects to root • incurs higher switches • implemented in switches latency penalties • adds 8 us latency • similar throughput P4 • nearly zero cost benefits

  38. Outline 1. Background on state machine replication and data center network 2. Ordered Unreliable Multicast 3. Network-Ordered Paxos 4. Evaluation

  39. NOPaxos Overview • Built on top of the guarantees of OUM • Client requests are totally ordered but can be dropped • No coordination in the common case • Replicas run agreement on drop detection • View change protocol for leader or sequencer failure

  40. Normal Operation Client Replica (leader) Replica Replica

  41. Normal Operation request Client OUM Replica (leader) Replica Replica

  42. Normal Operation request reply Client OUM Replica Execute (leader) Replica Replica

  43. Normal Operation waits for replies from majority request reply including Client leader’ s OUM Replica Execute (leader) Replica Replica

  44. Normal Operation waits for replies from majority request reply including Client leader’ s OUM Replica Execute (leader) no Replica coordination Replica

  45. Normal Operation waits for 1 Round Trip Time replies from majority request reply including Client leader’ s OUM Replica Execute (leader) no Replica coordination Replica

  46. Gap Agreement Replicas detect message drops • Non-leader replicas: recover the missing message from the leader • Leader replica: coordinates to commit a NO-OP (Paxos) • Efficient recovery from network anomalies

  47. View Change • Handles leader or sequencer failure • Ensures that all replicas are in a consistent state • Runs a view change protocol similar to VR • view-number is a tuple of <leader-number, session-number>

  48. Outline 1. Background on state machine replication and data center network 2. Ordered Unreliable Multicast 3. Network-Ordered Paxos 4. Evaluation

  49. Evaluation Setup • 3-level fat-tree network testbed • 5 replicas with 2.5 GHz Intel Xeon E5-2680 • Middle box sequencer Sequencer

  50. NOPaxos achieves better throughput and latency Latency (us) better ↓ better → Throughput (ops/sec)

  51. NOPaxos achieves better throughput and latency 1000 750 Latency (us) 500 better ↓ 250 0 65,000 130,000 195,000 260,000 better → Throughput (ops/sec)

  52. NOPaxos achieves better throughput and latency 1000 750 Latency (us) 500 better ↓ Paxos 250 Fast Paxos NOPaxos 0 65,000 130,000 195,000 260,000 better → Throughput (ops/sec)

  53. NOPaxos achieves better throughput and latency 1000 750 Latency (us) 500 4.7X throughput and better ↓ more than 40% Paxos reduction in latency 250 Fast Paxos NOPaxos 0 65,000 130,000 195,000 260,000 better → Throughput (ops/sec)

  54. NOPaxos achieves better throughput and latency 1000 Paxos + Batching 750 Latency (us) 500 4.7X throughput and better ↓ more than 40% Paxos reduction in latency 250 Fast Paxos NOPaxos 0 65,000 130,000 195,000 260,000 better → Throughput (ops/sec)

Recommend


More recommend