Scaling State Machine Replication Fernando Pedone University of - PowerPoint PPT Presentation

Scaling State Machine Replication Fernando Pedone University of Lugano (USI) Switzerland

State machine replication • Fundamental approach to fault tolerance ✦ Google Spanner ✦ Apache Zookeeper ✦ Windows Azure Storage ✦ MySQL Group Replication ✦ Galera Cluster, … 2

State machine replication is intuitive & simple • Replication transparency ✦ For clients ✦ For application developers • Simple execution model ✦ Replicas order all commands ✦ Replicas execute commands deterministically and in the same order 3

Configurable fault tolerance but bounded performance • Performance is bounded by what one replica can do ✦ Every replica needs to execute every command ✦ More replicas: same (if not worse) performance Throughput Servers How to scale state machine replication? 4

Scaling performance with partitioning • Partitioning (aka sharding) application state Partition Px Throughput Partition Py Servers Scalable performance (for single-partition commands) Problem #1: How to order commands in a partitioned system? Problem #2: How to execute commands in a partitioned system? 5

Ordering commands in a partitioned system • Atomic multicast ✦ Commands addressed (multicast) to one or more partitions ✦ Commands ordered within and across partitions • If S delivers C before C’, then no S’ delivers C’ before C Scalable SMR Partition Px C(x) Atomic multicast C(x,y) Partition Py Multi-Paxos C(y) Network 6

Executing multi-partition commands Partition X x x x C(x,y) : { x := y } Partition Y y y y Solution #1: Static partitioning of data Solution #2: Dynamic partitioning of data 7

Solution 1: Static partitioning of data • Execution model ✦ Client queries location oracle to determine partitions ✦ Client multicasts command to involved partitions ✦ Partitions exchange and temporary store objects needed to execute multi-partition commands ✦ Commands executed by all involved partitions • Location oracle ✦ Simple implementation thanks to static scheme 8

How to execute multi-partition commands? Partition X x y x y x y C(x,y): x := y Cached entries Partition Y y x y x y x 9

Static scheme, step-by-step Client Server start deliver command query oracle all local Yes objects? multicast command No to involved partitions send needed objects/signal to remote partitions wait for objects/ signal from remote partitions execute command receive result end send result 10

Solution 2: Dynamic partitioning of data • Execution model (key idea) ✦ Turn every command single-partition ✦ If command involves multiple partitions, move objects to a single partition before executing command • Location oracle ✦ Oracle implemented as a “special partition” ✦ Move operations involve oracle, source and destination partitions 11

Dynamic scheme, step-by-step Client Server start query oracle deliver command one No move objects partition? to one partition Yes all local Yes objects? multicast command No to partition execute result = retry command receive result send result Yes retry? No end 12

Termination and load balance • Ensuring termination of commands ✦ After retrying n times, command is multicast to all partitions ✦ Executed as a multi-partition command • Ensure load balancing among partitions ✦ Target partition in multi-partition command chosen randomly 13

Oracle: high availability and performance • Oracle implemented as a partition ✦ For fault tolerance • Clients cache oracle entries ✦ For performance ✦ Real oracle needed at first access and when objects change location ✦ Client retries command if cached location is stale 14

Dynamically (re-)partitioning the state • Decentralized strategy ✦ Client chooses one partition among involved partitions ✦ Each move involves oracle and concerned partitions 👎 ✦ No single entity has complete system knowledge 👎 ✦ Good performance with strong locality, but… ✦ …slow convergence  ✦ Poor performance with weak locality  P2 P1 15

Dynamically (re-)partitioning the state • Centralized strategy ✦ Oracle builds graph of objects and relations (commands) ✦ Oracle partitions O-R graph (METIS) and requests move operations to place all objects in one partition 👎 ✦ Near-optimum partitioning (both strong and weak locality) 👎 ✦ Fast convergence ✦ Oracle knows location of and relations among objects  ✦ Oracle solves a hard problem  16

Social network application (similar to Twitter) • GetTimeline ✦ Single-object command => always involves one partition • Post ✦ Multi-object command => may involve multiple partitions ✦ Strong locality • 0% edge cut, social graph can be perfectly partitioned ✦ Weak locality • 1% and 5% of edge cuts, after partitioning social graph 17

GetTimelines only (single-partition commands) Throughput Throughput 0% edge-cut 150 SMR Classic SMR Servers Throughput (kcps) Static SSMR 120 all schemes scale! Dyn decentralized DSSMR Dyn centralized (by design) DSSMRv2 90 Optimized static SSMRMetis 60 30 0 1 2 4 8 Number of partitions 18

Posts only, strong locality (0% edge cut) 0% edge-cut 80 SMR Classic SMR SSMR Static DSSMR Dyn decentralized 60 DSSMRv2 Throughput (kcps) Dyn centralized SSMRMetis Optimized static dynamic schemes 40 and optimized scale, but not static 20 0 1 2 4 8 Number of partitions Number of partitions 19

Posts only, weak locality (1% edge cut) 1% edge-cut 40 SMR Classic SMR SSMR Static DSSMR Dyn decentralized 30 DSSMRv2 only optimized and Throughput (kcps) Dyn centralized SSMRMetis centralized dynamic schemes Optimized static scale 20 10 0 1 2 4 8 Number of partitions Number of partitions 20

Conclusions • Scaling State Machine Replication ✦ Possible but locality is fundamental • OSs and DBs have known this for years ✦ Replication and partitioning transparency • The future ahead ✦ Decentralized schemes with quality of centralized schemes ✦ Expand scope of applications (e.g., data structures) ✦ “The inherent limits of scalable state machine replication” 21

More details: http://www.inf.usi.ch/faculty/pedone/scalesmr.html THANK YOU!!! Joint work with… Long Hoang Le Enrique Fynn Eduardo Bezerra Robbert van Renesse

Scaling State Machine Replication Fernando Pedone University of - PowerPoint PPT Presentation

Scaling State Machine Replication Fernando Pedone University of Lugano (USI) Switzerland State machine replication Fundamental approach to fault tolerance Google Spanner Apache Zookeeper Windows Azure Storage MySQL Group

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Asynchronous Replication

August 23, 2012 Data Replication/ETL: Terms Data Replication : Data Replication is the process of

MySQL Replication Tutorial Mats Kindahl Senior Software Engineer Replication Technology Lars

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Outline Scaling Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large

UP UP AND OUT: SCALING SOFTWARE WITH AKKA Jonas Bonr CTO Typesafe @jboner Scaling software

New features in MySQL Replication Lars Thalmann, Development Manager, Replication & Backup

Todays Topics - Chapter 15 Slide 1 performance enhancement Replication Replication of

Analysis of Scaling Algorithms for Matrix & Operator Scaling Contents Scaling Algorithms

Reasoning About Replication: State Machine Approach & Chain Replication Partial slides

Galera Replication Synchronous Multi-Master Replication for InnoDB ...well, why not for any other

Replication and Migration Background, Requirements and Strawman Migration and Replication

Consistency and Replication Chi Zhang czhang@cs.fiu.edu Object Replication (1) Organization of

DRBD 9 Linux Storage Replication Lars Ellenberg LINBIT HA Solutions GmbH Vienna, Austria

Time, Clocks, and State Machine Replication Dan Ports, CSEP 552 Todays question How

Capacity Planning and Headroom Analysis for Taming Database Replication Latency - Experiences

Built-in Physical and Logical Replication in Postgresql Frat Gle - Company Hepsiexpress

Characterizing Load Imbalance in Real-World Networked Caches Qi Huang Cornell U, Facebook Helga

CockroachDB Scalable, survivable, strongly consistent, SQL presented by Ben Darnell / CTO About

RSM & Paxos Consensus Trilogy - Episode II Replicated State Machine What is the problem?

Distributed Systems (ICE 601) Replication & Consistency - Part 3 Dongman Lee ICU Class

Enhanced VMware Backup and Replication with Vembu VMBackup v3.5! www.vembu.com About Vembu

t

Scaling State Machine Replication Fernando Pedone University of - PowerPoint PPT Presentation

Scaling State Machine Replication Fernando Pedone University of Lugano (USI) Switzerland State machine replication Fundamental approach to fault tolerance Google Spanner Apache Zookeeper Windows Azure Storage MySQL Group

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Asynchronous Replication

August 23, 2012 Data Replication/ETL: Terms Data Replication : Data Replication is the process of

MySQL Replication Tutorial Mats Kindahl Senior Software Engineer Replication Technology Lars

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Outline Scaling Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large

UP UP AND OUT: SCALING SOFTWARE WITH AKKA Jonas Bonr CTO Typesafe @jboner Scaling software

New features in MySQL Replication Lars Thalmann, Development Manager, Replication &amp; Backup

Todays Topics - Chapter 15 Slide 1 performance enhancement Replication Replication of

Analysis of Scaling Algorithms for Matrix &amp; Operator Scaling Contents Scaling Algorithms

Reasoning About Replication: State Machine Approach &amp; Chain Replication Partial slides

Galera Replication Synchronous Multi-Master Replication for InnoDB ...well, why not for any other

Replication and Migration Background, Requirements and Strawman Migration and Replication

Consistency and Replication Chi Zhang czhang@cs.fiu.edu Object Replication (1) Organization of

DRBD 9 Linux Storage Replication Lars Ellenberg LINBIT HA Solutions GmbH Vienna, Austria

Time, Clocks, and State Machine Replication Dan Ports, CSEP 552 Todays question How

Capacity Planning and Headroom Analysis for Taming Database Replication Latency - Experiences

Built-in Physical and Logical Replication in Postgresql Frat Gle - Company Hepsiexpress

Characterizing Load Imbalance in Real-World Networked Caches Qi Huang Cornell U, Facebook Helga

CockroachDB Scalable, survivable, strongly consistent, SQL presented by Ben Darnell / CTO About

RSM &amp; Paxos Consensus Trilogy - Episode II Replicated State Machine What is the problem?

Distributed Systems (ICE 601) Replication &amp; Consistency - Part 3 Dongman Lee ICU Class

Enhanced VMware Backup and Replication with Vembu VMBackup v3.5! www.vembu.com About Vembu

t

New features in MySQL Replication Lars Thalmann, Development Manager, Replication & Backup

Analysis of Scaling Algorithms for Matrix & Operator Scaling Contents Scaling Algorithms

Reasoning About Replication: State Machine Approach & Chain Replication Partial slides

RSM & Paxos Consensus Trilogy - Episode II Replicated State Machine What is the problem?

Distributed Systems (ICE 601) Replication & Consistency - Part 3 Dongman Lee ICU Class