Scaling State Machine Replication Fernando Pedone University of Lugano (USI) Switzerland
State machine replication • Fundamental approach to fault tolerance ✦ Google Spanner ✦ Apache Zookeeper ✦ Windows Azure Storage ✦ MySQL Group Replication ✦ Galera Cluster, … 2
State machine replication is intuitive & simple • Replication transparency ✦ For clients ✦ For application developers • Simple execution model ✦ Replicas order all commands ✦ Replicas execute commands deterministically and in the same order 3
Configurable fault tolerance but bounded performance • Performance is bounded by what one replica can do ✦ Every replica needs to execute every command ✦ More replicas: same (if not worse) performance Throughput Servers How to scale state machine replication? 4
Scaling performance with partitioning • Partitioning (aka sharding) application state Partition Px Throughput Partition Py Servers Scalable performance (for single-partition commands) Problem #1: How to order commands in a partitioned system? Problem #2: How to execute commands in a partitioned system? 5
Ordering commands in a partitioned system • Atomic multicast ✦ Commands addressed (multicast) to one or more partitions ✦ Commands ordered within and across partitions • If S delivers C before C’, then no S’ delivers C’ before C Scalable SMR Partition Px C(x) Atomic multicast C(x,y) Partition Py Multi-Paxos C(y) Network 6
Executing multi-partition commands Partition X x x x C(x,y) : { x := y } Partition Y y y y Solution #1: Static partitioning of data Solution #2: Dynamic partitioning of data 7
Solution 1: Static partitioning of data • Execution model ✦ Client queries location oracle to determine partitions ✦ Client multicasts command to involved partitions ✦ Partitions exchange and temporary store objects needed to execute multi-partition commands ✦ Commands executed by all involved partitions • Location oracle ✦ Simple implementation thanks to static scheme 8
How to execute multi-partition commands? Partition X x y x y x y C(x,y): x := y Cached entries Partition Y y x y x y x 9
Static scheme, step-by-step Client Server start deliver command query oracle all local Yes objects? multicast command No to involved partitions send needed objects/signal to remote partitions wait for objects/ signal from remote partitions execute command receive result end send result 10
Solution 2: Dynamic partitioning of data • Execution model (key idea) ✦ Turn every command single-partition ✦ If command involves multiple partitions, move objects to a single partition before executing command • Location oracle ✦ Oracle implemented as a “special partition” ✦ Move operations involve oracle, source and destination partitions 11
Dynamic scheme, step-by-step Client Server start query oracle deliver command one No move objects partition? to one partition Yes all local Yes objects? multicast command No to partition execute result = retry command receive result send result Yes retry? No end 12
Termination and load balance • Ensuring termination of commands ✦ After retrying n times, command is multicast to all partitions ✦ Executed as a multi-partition command • Ensure load balancing among partitions ✦ Target partition in multi-partition command chosen randomly 13
Oracle: high availability and performance • Oracle implemented as a partition ✦ For fault tolerance • Clients cache oracle entries ✦ For performance ✦ Real oracle needed at first access and when objects change location ✦ Client retries command if cached location is stale 14
Dynamically (re-)partitioning the state • Decentralized strategy ✦ Client chooses one partition among involved partitions ✦ Each move involves oracle and concerned partitions 👎 ✦ No single entity has complete system knowledge 👎 ✦ Good performance with strong locality, but… ✦ …slow convergence ✦ Poor performance with weak locality P2 P1 15
Dynamically (re-)partitioning the state • Centralized strategy ✦ Oracle builds graph of objects and relations (commands) ✦ Oracle partitions O-R graph (METIS) and requests move operations to place all objects in one partition 👎 ✦ Near-optimum partitioning (both strong and weak locality) 👎 ✦ Fast convergence ✦ Oracle knows location of and relations among objects ✦ Oracle solves a hard problem 16
Social network application (similar to Twitter) • GetTimeline ✦ Single-object command => always involves one partition • Post ✦ Multi-object command => may involve multiple partitions ✦ Strong locality • 0% edge cut, social graph can be perfectly partitioned ✦ Weak locality • 1% and 5% of edge cuts, after partitioning social graph 17
GetTimelines only (single-partition commands) Throughput Throughput 0% edge-cut 150 SMR Classic SMR Servers Throughput (kcps) Static SSMR 120 all schemes scale! Dyn decentralized DSSMR Dyn centralized (by design) DSSMRv2 90 Optimized static SSMRMetis 60 30 0 1 2 4 8 Number of partitions 18
Posts only, strong locality (0% edge cut) 0% edge-cut 80 SMR Classic SMR SSMR Static DSSMR Dyn decentralized 60 DSSMRv2 Throughput (kcps) Dyn centralized SSMRMetis Optimized static dynamic schemes 40 and optimized scale, but not static 20 0 1 2 4 8 Number of partitions Number of partitions 19
Posts only, weak locality (1% edge cut) 1% edge-cut 40 SMR Classic SMR SSMR Static DSSMR Dyn decentralized 30 DSSMRv2 only optimized and Throughput (kcps) Dyn centralized SSMRMetis centralized dynamic schemes Optimized static scale 20 10 0 1 2 4 8 Number of partitions Number of partitions 20
Conclusions • Scaling State Machine Replication ✦ Possible but locality is fundamental • OSs and DBs have known this for years ✦ Replication and partitioning transparency • The future ahead ✦ Decentralized schemes with quality of centralized schemes ✦ Expand scope of applications (e.g., data structures) ✦ “The inherent limits of scalable state machine replication” 21
More details: http://www.inf.usi.ch/faculty/pedone/scalesmr.html THANK YOU!!! Joint work with… Long Hoang Le Enrique Fynn Eduardo Bezerra Robbert van Renesse
Recommend
More recommend