StoneTor Fault Tolerance at Speed Todd L. Montgomery @toddlmontgomery
About me…
What type of Fault Tolerance? What is Clustering? Why Aeron? Design for Speeding Up?
What type of Fault Tolerance? What is Clustering? Why Aeron? Design for Speeding Up? Efficiency
https://www.nature.com/articles/d41586-018-06610-y https://www.forbes.com/sites/forbestechcouncil/2017/12/15/why-energy-is-a-big-and-rapidly-growing-problem-for-data-centers/#344456665a30 https://www.datacenterdynamics.com/opinions/power-consumption-data-centers-global-problem/
We seem to assume efficiency/security/quality/etc. is a “special” characteristic added … later… if at all
Fault Tolerance
Service Client
Service Client
Service Service Service Client
Service Service Service Client Client Client
e t a t S Service Service Service Client Client Client
State “Storage” Service Service Service
e t a t S Service Service Service Client Client Client
Fault Tolerance of State
Partition Replication State Service Service Service
Contiguous Log with Snapshot & Replay
1 2 3 4 5 6 … X
1 2 3 4 State 5 6 … X
1 2 3 4 Snapshot State 5 6 … X
1 2 3 Snapshot 4 Snapshot State 5 5 6 6 State … … X X
Clustered Services
Service Service Service
Service Service Service Log Archive Log Archive Log Archive
Replicated State Machines https://en.wikipedia.org/wiki/State_machine_replication
Replicated State Machines Each Replicated Service Same event log Same input ordering Log replicated locally
Replicated State Machines Checkpoints / Snapshots Event in the log “Rolling” up previous log events
When should a service “consume” (or process) a log event?
Service Service Service 1 2 1 2 3 4 5 6 1 2 3 4 5 6 7 Archive Archive Archive
Once processed, Event can not be altered Only process once event is stable
Replicated State Machines Raft Consensus Event must be recorded at majority of Replicas before being consumed by any Replica https://raft.github.io/
Service Service Service 1 2 1 2 3 4 5 6 1 2 3 4 5 6 7 Archive Archive Archive
Service Service Service 1 2 1 2 3 4 5 6 1 2 3 4 5 6 7 Archive Archive Archive
Raft Strong Leader Elected member of the Cluster Orders Input Disseminates Consensus
Service Service Service Consensus Consensus Consensus Archive Archive Archive
Replicated State Machines Raft is An algorithm with formal verification
Replicated State Machines Raft is not A specification Nor A complete system
The Real World More than Raft Leader timestamps events Async, not RPC-based Timers
* Leader Service Service Service Consensus Consensus Consensus Archive Archive Archive Client
Benefits
Benefits Determinism Log is immutable Log can be played, stopped, & replayed Each event is timestamped Services restarted from snapshot & log
What Can You Do?
Distributed Key/Value Store Distributed Timers Distributed Locks
Finance Matching Engines Order Management Market Surveillance P&L, Risk, …
Beyond Venue Ticketing / Reservations Auctions Hint - a contended database is a good indicator
Why Aeron?
Aeron Efficient reliable UDP unicast, UDP multicast, and IPC message transport Java, C/C++, C#, Go https://github.com/real-logic/Aeron
Aeron And a little bit more… Very fast Archival & Replay https://github.com/real-logic/Aeron
The “Efficient” bit…
All communications Aeron publications & subscriptions Aeron archival & replay Aeron shared counters
Consensus based on Aeron stream position
Batching Critical to efficient operation Optimizing pipelined throughput
Flow Control Critical to correct operation
Design for Efficiency?
Cache Hit/Miss Ratios Branch Prediction Allocation Rates Garbage Collection Inlining Optimizations
Not… Yet…
Ownership, Dependency, & Coupling Complexity Layers of Abstraction (ain’t free) Resource Management
Closer… But… Still. Not. Yet.
"AmdahlsLaw" by Daniels220 at English Wikipedia - Own work based on: File:AmdahlsLaw.png. Licensed under CC BY-SA 3.0 via Wikimedia Commons
Universal Scalability Law 20 18 16 14 Speedup 12 10 8 6 4 2 0 1 2 4 8 16 32 64 128 256 512 1024 Processors Amdahl USL
Breakdown Interactions Fundamental Sequential Operations
Ingress Message, Sequence, Disseminate Client Leader Ingress Log Log Log (multicast or serial unicast) Event Event Member Status Follower X Follower Y
Followers Append Client Leader Ingress Append Append Log (multicast or serial unicast) Position Position Member Status Follower X Follower Y
Commit Message Client Leader Ingress Commit Commit Log (multicast or serial unicast) Position Position Member Status Follower X Follower Y
Breakdown Interactions Pipeline-able Operation & Batching
Stream Positions Log Event @8192 Leader Follower Append Position @6912 Commit Position @4096 Archive Position @8096 Archive Position @7168 Store locally asynchronous to Position processing by Consensus, & Log processing by Service Log (multicast or serial unicast) Batching: Log, Appends, Commits Member Status
Doesn’t this Complicate Recovery?
Recovery Positions Follower Follower Follower Archive Position @8096 Archive Position @7584 Archive Position @7168 Commit Position @4096 Commit Position @4064 Commit Position @4032 Service Position @4096 Service Position @4064 Service Position @3776 A synchronous system doesn’t make this complexity go away! Election still needs to assert state of the cluster & locally catch-up
Limitations of Efficiency Throughput & Latency
Round-Trip Time (RTT) Service A Service Ox Client Leader Log Event Followers Append Position Commit Position Constant Delay Network Client to Service A: 0.5 RTT Ingress Client to Service Ox: 1 RTT Log (multicast or serial unicast) Client to Service A (on Commit): 1.5 RTT Client to Service Ox (on Commit): 2 RTT Member Status
Limits from Constant Delay Shared Memory RTT <100ns DC RTT <100us Client to Service A: 50ns Client to Service A: 50us Client to Service Ox: 100ns Client to Service Ox: 100us Client to Service A (on Commit): 150ns Client to Service A (on Commit): 150us Client to Service Ox (on Commit): 200ns Client to Service Ox (on Commit): 200us Rack (Kernel Bypass) RTT <10us Client to Service A: 5us Client to Service Ox: 10us Client to Service A (on Commit): 15us Client to Service Ox (on Commit): 20us
Measured Latency at Throughput 100K msgs/sec 200K msgs/sec Intel Xeon Gold 5118 (2.30GHz, 12 cores) 300 32GB DDR4 2400 MHz ECC RAM Intel Optane SSD 900P Series 480GB SolarFlare X2522-PLUS 10GbE NIC 225 All servers are connected to an Arista 7150S RTT (us) 150 CentOS Linux 7.7, kernel 4.4.195-1.el7.elrepo.x86_64 tuned for low-latency workload. 75 Courtesy Mark Price 0 Min 0.50 0.90 0.99 0.9999 0.999999 Max Percentile Single client session, bursts of 20x 200B messages, 3-node cluster, Service(s) echo(es) the payload back.
Takeways Efficiency is part of design Power of a timestamped, replicated log Replicated State Machines
Current Status Aeron Archiving - fully supported Aeron Clustering - pre-release Sponsored by https://weareadaptive.com/
Questions? StoneTor Aeron: https://github.com/real-logic/Aeron Twitter: @toddlmontgomery Thank You!
Recommend
More recommend