Scalable Consistency in Scatter A Distributed Key-Value Storage System Lisa Glendenning Ivan Beschastnikh University of Washington Arvind Krishnamurthy Thomas Anderson Supported by NSF CNS-0963754 October 2011 1 1
Internet services depend on distributed key-value stores Consistency Scatter Dynamo Scalability 2 2
Scatter: Goals ✓ linearizable consistency semantics ✓ scalable in a wide area network ✓ high availability ✓ performance close to existing systems 3 3
Scatter: Approach combine ideas from: scalable peer-to-peer consistent datacenter systems systems ✓ distributed hash table ✓ consensus ✓ self-organization ✓ replication ✓ decentralization ✓ transactions 4 4
Distributed Hash Tables: Background core functionality: partition and assign keys to nodes system structure: nodes knowledge of system state is distributed among all nodes keys system management: nodes coordinate locally to respond to churn, e.g., • give keys to new nodes • take over keys of failed links between nodes form overlay nodes 5 5
Distributed Hash Tables: Faults Cause Inconsistencies Example : c joins between a and b c a b k a k b c.pred = a c.succ = b JOIN a.succ = c b.pred = c c b.keys = (k c ,k b ] c.keys = (k a ,k c ] a b k a k c k b 6 6
Distributed Hash Tables: Faults Cause Inconsistencies Example : c joins between a and b what could go wrong? c a b FAULT OUTCOME communication both b and c claim k a k b c.pred = a fault between b ownership of c.succ = b and c JOIN (k a ,k c ] a.succ = c no node claims c fails during b.pred = c ownership of c operation b.keys = (k c ,k b ] (k a ,k c ] communication c.keys = (k a ,k c ] a b routes through a fault between a skip over c and c k a k c k b 6 6
Distributed Hash Tables: Weak Atomicity Causes Anomalies DHTs use ad-hoc protocols to add and remove nodes what happens if... • two nodes join at the same place at the same time • two adjacent nodes leave at the same time • during a node join the predecessor leaves • one node mistakenly thinks another node has failed ... 7 7
Scatter: Design Overview How is Scatter different? use groups as building blocks instead of nodes What is a group? set of nodes that cooperatively manage a key-range What does this give us? • nodes within a group act as a single entity • a group is much less likely to fail than an individual node • distributed transactions for operations group involving multiple groups node 8 8
Scatter: Group Anatomy a c b k z k a k b k c ‣ group replicates all state ‣ key-range further partitioned among members with Paxos among nodes of group for performance nodes = {a,b,c} keys = (k z ,k c ] values = {...} a.keys = (k z ,k a ] b.keys = (k a ,k b ] ‣ changes to group membership c.keys = (k b ,k c ] are Paxos reconfigurations: ‣ each node orders client • include new nodes operations on its keys • exclude failed nodes 9 9
Scatter: Self-Reorganization some problems can’t be handled within a single group • small groups are at risk of failing • large groups are slow • load imbalance across groups multi-group operations: a a SPLIT • merge two small groups into one b 1 • split one large group into two b b 2 • rebalance keys and nodes MERGE between groups c c distributed transactions coordinated locally by groups 10 10
Example: Group Split 2PC a split? b c 11 11
Example: Group Split 2PC ok! a a split? b b ok! c c 11 11
Example: Group Split 2PC ok! a a a split? split! b b b ok! c c c 11 11
Example: Group Split 2PC a ok! a a a b 1 split? split! b b b b 2 ok! c c c c 11 11
Example: Group Split 2PC a ok! a a a b 1 split? split! b b b b 2 ok! c c c c ok! b split? ok! 11 11
Example: Group Split 2PC a ok! a a a b 1 split? split! b b b b 2 ok! c c c c ok! split b? a ok! ok! b split? ok! ok! split b? c ok! 11 11
Example: Group Split 2PC a ok! a a a b 1 split? split! b b b b 2 ok! c c c c ok! RECONFIGURE ! split b? a ok! ok! b 1 ok! b split? b split! ok! b 2 ok! ok! split b? c ok! committed 11 11
Example: Group Split 2PC a ok! a a a b 1 split? split! b b b b 2 ok! c c c c ok! ok! b split! RECONFIGURE ! a split b? a ok! ok! ok! b 1 ok! b split? b split! ok! ok! b 2 ok! ok! split b? b split! c c ok! ok! committed 11 11
Scatter ✓ linearizable consistency semantics ...group consensus, transactions ✓ scalable in a wide area network ...local operations ✓ high availability ...replication, reconfiguration ✓ performance close to existing systems ...key partitioning, optimizations 12 12
Evaluation: Overview Questions: 1.How robust is Scatter in high-churn peer-to- peer environment? 2.How does Scatter adapt to dynamic workload in datacenter environment? Comparisons: Environment P2P Datacenter Comparison OpenDHT ZooKeeper System 13 13
Comparison: OpenDHT Layered OpenDHT’s recursive routing on top of Scatter groups Implemented a Twitter- like application, Chirp Experimental Setup: • 840 PlanetLab nodes • injected node churn at varying rates • Twitter traces as a workload • tweets and social network stored in DHT 14 14
Comparison: OpenDHT Consistency Availability 100 100 completed fetches (%) consistent fetches (%) 95 95 90 90 85 85 Scatter Scatter 80 80 OpenDHT OpenDHT 75 75 100 300 500 700 900 100 300 500 700 900 node lifetime (seconds) node lifetime (seconds) Scatter has zero inconsistencies and high availability even under churn 15 stency 15
Comparison: OpenDHT Latency 1400 Scatter fetch latency (ms) OpenDHT 1050 ] 10-12% 700 350 0 100 300 500 700 900 node lifetime (seconds) Scalable consistency is cheap 16 16
Comparison: Replicated ZooKeeper ZooKeeper: small-scale, centralized coordination service Replicated ZooKeeper: statically partitioned global key-space to multiple, isolated ZooKeeper instantiations Experimental Setup: Z 1 • testbed: Emulab • varied total number of nodes Z 2 • no churn Z 3 • same Chirp workload 17 17
Comparison: Replicated ZooKeeper Scalability 400 Scatter throughput (1000 ops/sec) ZooKeeper 300 200 100 0 5 25 50 75 100 125 150 total number of nodes Dynamic partitioning adapts to changes in workload 18 18
Scatter: Summary ✓ consensus groups of nodes as fault- tolerant building blocks ✓ distributed transactions across groups to repartition the global key-space ✓ evaluation against OpenDHT and ZooKeeper shows strict consistency, linear scalability, and high availability 19 19
Recommend
More recommend