fault tolerance at speed
play

Fault Tolerance at Speed Todd L. Montgomery @toddlmontgomery About - PowerPoint PPT Presentation

StoneTor Fault Tolerance at Speed Todd L. Montgomery @toddlmontgomery About me What type of Fault Tolerance? What is Clustering? Why Aeron? Design for Speeding Up? What type of Fault Tolerance? What is Clustering? Why Aeron? Design


  1. StoneTor Fault Tolerance at Speed Todd L. Montgomery @toddlmontgomery

  2. About me…

  3. What type of Fault Tolerance? What is Clustering? Why Aeron? Design for Speeding Up?

  4. What type of Fault Tolerance? What is Clustering? Why Aeron? Design for Speeding Up? Efficiency

  5. https://www.nature.com/articles/d41586-018-06610-y https://www.forbes.com/sites/forbestechcouncil/2017/12/15/why-energy-is-a-big-and-rapidly-growing-problem-for-data-centers/#344456665a30 https://www.datacenterdynamics.com/opinions/power-consumption-data-centers-global-problem/

  6. We seem to assume efficiency/security/quality/etc. is a “special” characteristic added … later… if at all

  7. Fault Tolerance

  8. Service Client

  9. Service Client

  10. Service Service Service Client

  11. Service Service Service Client Client Client

  12. e t a t S Service Service Service Client Client Client

  13. State “Storage” Service Service Service

  14. e t a t S Service Service Service Client Client Client

  15. Fault Tolerance of State

  16. Partition Replication State Service Service Service

  17. Contiguous Log with Snapshot & Replay

  18. 1 2 3 4 5 6 … X

  19. 1 2 3 4 State 5 6 … X

  20. 1 2 3 4 Snapshot State 5 6 … X

  21. 1 2 3 Snapshot 4 Snapshot State 5 5 6 6 State … … X X

  22. Clustered Services

  23. Service Service Service

  24. Service Service Service Log Archive Log Archive Log Archive

  25. Replicated State Machines https://en.wikipedia.org/wiki/State_machine_replication

  26. Replicated State Machines Each Replicated Service Same event log Same input ordering Log replicated locally

  27. Replicated State Machines Checkpoints / Snapshots Event in the log “Rolling” up previous log events

  28. When should a service “consume” (or process) a log event?

  29. Service Service Service 1 2 1 2 3 4 5 6 1 2 3 4 5 6 7 Archive Archive Archive

  30. Once processed, Event can not be altered Only process once event is stable

  31. Replicated State Machines Raft Consensus Event must be recorded at majority of Replicas before being consumed by any Replica https://raft.github.io/

  32. Service Service Service 1 2 1 2 3 4 5 6 1 2 3 4 5 6 7 Archive Archive Archive

  33. Service Service Service 1 2 1 2 3 4 5 6 1 2 3 4 5 6 7 Archive Archive Archive

  34. Raft Strong Leader Elected member of the Cluster Orders Input Disseminates Consensus

  35. Service Service Service Consensus Consensus Consensus Archive Archive Archive

  36. Replicated State Machines Raft is An algorithm with formal verification

  37. Replicated State Machines Raft is not A specification Nor A complete system

  38. The Real World More than Raft Leader timestamps events Async, not RPC-based Timers

  39. * Leader Service Service Service Consensus Consensus Consensus Archive Archive Archive Client

  40. Benefits

  41. Benefits Determinism Log is immutable Log can be played, stopped, & replayed Each event is timestamped Services restarted from snapshot & log

  42. What Can You Do?

  43. Distributed Key/Value Store Distributed Timers Distributed Locks

  44. Finance Matching Engines Order Management Market Surveillance P&L, Risk, …

  45. Beyond Venue Ticketing / Reservations Auctions Hint - a contended database is a good indicator

  46. Why Aeron?

  47. Aeron Efficient reliable UDP unicast, UDP multicast, and IPC message transport Java, C/C++, C#, Go https://github.com/real-logic/Aeron

  48. Aeron And a little bit more… Very fast Archival & Replay https://github.com/real-logic/Aeron

  49. The “Efficient” bit…

  50. All communications Aeron publications & subscriptions Aeron archival & replay Aeron shared counters

  51. Consensus based on Aeron stream position

  52. Batching Critical to efficient operation Optimizing pipelined throughput

  53. Flow Control Critical to correct operation

  54. Design for Efficiency?

  55. Cache Hit/Miss Ratios Branch Prediction Allocation Rates Garbage Collection Inlining Optimizations

  56. Not… Yet…

  57. Ownership, Dependency, & Coupling Complexity Layers of Abstraction (ain’t free) Resource Management

  58. Closer… But… Still. Not. Yet.

  59. "AmdahlsLaw" by Daniels220 at English Wikipedia - Own work based on: File:AmdahlsLaw.png. Licensed under CC BY-SA 3.0 via Wikimedia Commons

  60. Universal Scalability Law 20 18 16 14 Speedup 12 10 8 6 4 2 0 1 2 4 8 16 32 64 128 256 512 1024 Processors Amdahl USL

  61. Breakdown Interactions Fundamental Sequential Operations

  62. Ingress Message, Sequence, Disseminate Client Leader Ingress Log Log Log (multicast or serial unicast) Event Event Member Status Follower X Follower Y

  63. Followers Append Client Leader Ingress Append Append Log (multicast or serial unicast) Position Position Member Status Follower X Follower Y

  64. Commit Message Client Leader Ingress Commit Commit Log (multicast or serial unicast) Position Position Member Status Follower X Follower Y

  65. Breakdown Interactions Pipeline-able Operation & Batching

  66. Stream Positions Log Event @8192 Leader Follower Append Position @6912 Commit Position @4096 Archive Position @8096 Archive Position @7168 Store locally asynchronous to Position processing by Consensus, & Log processing by Service Log (multicast or serial unicast) Batching: Log, Appends, Commits Member Status

  67. Doesn’t this Complicate Recovery?

  68. Recovery Positions Follower Follower Follower Archive Position @8096 Archive Position @7584 Archive Position @7168 Commit Position @4096 Commit Position @4064 Commit Position @4032 Service Position @4096 Service Position @4064 Service Position @3776 A synchronous system doesn’t make this complexity go away! Election still needs to assert state of the cluster & locally catch-up

  69. Limitations of Efficiency Throughput & Latency

  70. Round-Trip Time (RTT) Service A Service Ox Client Leader Log Event Followers Append Position Commit Position Constant Delay Network Client to Service A: 0.5 RTT Ingress Client to Service Ox: 1 RTT Log (multicast or serial unicast) Client to Service A (on Commit): 1.5 RTT Client to Service Ox (on Commit): 2 RTT Member Status

  71. Limits from Constant Delay Shared Memory RTT <100ns DC RTT <100us Client to Service A: 50ns Client to Service A: 50us Client to Service Ox: 100ns Client to Service Ox: 100us Client to Service A (on Commit): 150ns Client to Service A (on Commit): 150us Client to Service Ox (on Commit): 200ns Client to Service Ox (on Commit): 200us Rack (Kernel Bypass) RTT <10us Client to Service A: 5us Client to Service Ox: 10us Client to Service A (on Commit): 15us Client to Service Ox (on Commit): 20us

  72. Measured Latency at Throughput 100K msgs/sec 200K msgs/sec Intel Xeon Gold 5118 (2.30GHz, 12 cores) 300 32GB DDR4 2400 MHz ECC RAM Intel Optane SSD 900P Series 480GB SolarFlare X2522-PLUS 10GbE NIC 225 All servers are connected to an Arista 7150S RTT (us) 150 CentOS Linux 7.7, kernel 4.4.195-1.el7.elrepo.x86_64 tuned for low-latency workload. 75 Courtesy Mark Price 0 Min 0.50 0.90 0.99 0.9999 0.999999 Max Percentile Single client session, bursts of 20x 200B messages, 3-node cluster, Service(s) echo(es) the payload back.

  73. Takeways Efficiency is part of design Power of a timestamped, replicated log Replicated State Machines

  74. Current Status Aeron Archiving - fully supported Aeron Clustering - pre-release Sponsored by https://weareadaptive.com/

  75. Questions? StoneTor Aeron: https://github.com/real-logic/Aeron Twitter: @toddlmontgomery Thank You!

Recommend


More recommend