motivation
play

Motivation Large-scale distributed systems becoming more common - PowerPoint PPT Presentation

Census: Location-Aware Membership Management for Large-Scale Distributed Systems James Cowling Dan R. K. Ports Barbara Liskov Raluca Ada Popa Abhijeet Gaikwad* MIT CSAIL *cole Centrale Paris Motivation Large-scale distributed systems


  1. Census: Location-Aware Membership Management for Large-Scale Distributed Systems James Cowling Dan R. K. Ports Barbara Liskov Raluca Ada Popa Abhijeet Gaikwad* MIT CSAIL *École Centrale Paris

  2. Motivation Large-scale distributed systems becoming more common multiple datacenters, cloud computing, etc. Reconfigurable distributed services adapt as nodes join, leave, or fail A membership service that tracks changes in system membership can simplify system design

  3. Census A platform for building large-scale, distributed applications Two main components: Membership service Multicast communication mechanism Designed to work in the wide-area Locality-aware; fault tolerant

  4. Membership Service Time divided into sequential, fixed-duration epochs Each epoch has a membership view : List of nodes (ID, IP address, location, etc.) Consistency property: every node sees the same membership view for a particular epoch ➡ can simplify protocol design (e.g. partitioning storage)

  5. Consistency & Scalability Existing systems: tradeoff between consistency and scalability Examples: - virtual synchrony (e.g. ISIS, Spread) - distributed hash tables (e.g. Chord, Pastry) Census provides consistent membership views and is designed for large-scale, wide-area systems

  6. Membership Service: Basic Approach • Designate one node as leader

  7. Membership Service: Basic Approach • Designate one node as leader • Nodes report membership changes to leader

  8. Membership Service: Basic Approach • Designate one node as leader • Nodes report membership changes to leader • Leader aggregates changes; multicasts item

  9. Membership Service: Basic Approach • Designate one node as leader • Nodes report membership changes to leader • Leader aggregates changes; multicasts item • Members enter next epoch, update membership

  10. What are the Challenges? Delivering items efficiently and reliably ➡ Multicast mechanism Reducing load on the leader ➡ Multi-region structure Dealing with leader failure ➡ Fault tolerance

  11. Outline • Overview • Basic Approach • Multicast Mechanism • Multi-region Design • Fault Tolerance • Evaluation

  12. Multicast Mechanism Need multicast to distribute membership updates and application data efficiently Goals: high reliability, low latency, fair load balancing Many multicast protocols exist... Census takes a different approach exploits consistent membership information for a simpler design and lower overhead

  13. Multicast Topology Multiple interior-disjoint trees (similar to SplitStream) Each node interior in one tree, leaf in others Membership data distributed in full on each tree. Application's multicast data erasure-coded Improved reliability and load balancing vs. a single tree

  14. Multicast Topology

  15. Multicast Topology 14

  16. Multicast Topology 15

  17. Building Multicast Trees Exploit consistent membership knowledge: tree structure given by deterministic function of membership ➡ Allows simple “centralized” algorithm in distributed context Nodes independently recompute trees “on-the-fly”, upon receiving membership updates No protocol overhead beyond that of membership service (even during churn!)

  18. Tree Building Algorithm

  19. Tree Building Algorithm Background: network coordinates (e.g. Vivaldi) d(x,y) ≈ latency(x,y)

  20. Tree Building Algorithm Assign nodes to a tree (color) based on ID

  21. Building the Red Tree Split region through center of mass, along widest axis

  22. Building the Red Tree Choose closest red node in each subregion, attach to root

  23. Building the Red Tree Recursively subdivide each subregion in the same way

  24. Building the Red Tree Recursively subdivide each subregion in the same way

  25. Building the Red Tree Recursively subdivide each subregion in the same way

  26. Building the Red Tree Recursively subdivide each subregion in the same way

  27. Building the Red Tree Recursively subdivide each subregion in the same way

  28. Building the Red Tree Attach other-colored nodes to the nearest available red node

  29. Multicast Improvements Reduce bandwidth overhead – avoid sending redundant data Reduce multicast latency – choose fragments to send based on expected path length Improve reliability during failures – reconstruct missing fragments from other trees

  30. Outline • Overview • Basic Approach • Multicast Mechanism • Multi-region Design • Fault Tolerance • Evaluation

  31. Multi-Region Structure Divide large deployments into location-based regions

  32. Multi-Region Structure One region leader per region, plus global leader

  33. Multi-Region Structure Region leaders aggregate membership changes from region

  34. Multi-Region Structure Region leaders aggregate membership changes from region

  35. Multi-Region Structure Global leader combines region reports to produce item

  36. Region Dynamics Regions split when they grow too large Global leader signals split in the next item Nodes independently split region across widest axis using consistent membership knowledge Regions merge when one grows too small Similar process Nodes assigned to nearest region on joining

  37. Multi-Region Structure Benefits – fewer messages processed by leader – fewer wide-area communications – cheaper multicast tree computation – useful abstraction for applications

  38. Partial Knowledge Maintaining global membership knowledge is usually feasible Except: very large, dynamic, and/or bandwidth-constrained systems Partial knowledge: each node knows only the membership of its own region and summary information of other regions

  39. Outline • Overview • Basic Approach • Multicast Mechanism • Multi-region Design • Fault Tolerance • Evaluation

  40. Fault Tolerance Global leader and region leaders can fail Solution: replication Use standard state machine replication techniques Replication level based on expected concurrent failures Optional: tolerating Byzantine faults

  41. Outline • Overview • Basic Approach • Multicast Mechanism • Multi-region Design • Fault Tolerance • Evaluation

  42. Evaluation PlanetLab deployment 614 nodes Theoretical analysis scalability to larger systems Simulator evaluate multicast performance

  43. PlanetLab Deployment 614 nodes; 30 second epochs; 1 KB/epoch multicast Reported membership (nodes) 600 500 400 300 200 10% 25% 100 failed failed 0 0 20 40 60 80 100 120 140 Time (epochs) Mean bandwidth per node (KB/s) 1 0.8 0.6 0.4 Bandwidth usage 0.2 Multicast data size 0 0 20 40 60 80 100 120 140 Time (epochs)

  44. Bandwidth Overhead Membership management cost analysis Very high churn rate (avg. node lifetime 30 minutes) Multiple Regions Partial Knowledge 10 10 Bandwidth Overhead (KB/s) 1 1 0.1 0.1 0.01 0.01 100 1000 10000 100000 100 1000 10000 100000 Number of Nodes Number of Nodes Global Leader Region Leader Regular Node

  45. Multicast Reliability Fraction of nodes successfully receiving multicast Simulation results (10,000 nodes) 1 0.8 Success Rate 0.6 0.4 0.2 0 0.01 0.1 1 Fraction of Bad Nodes 12/16 coding (data) 8/16 coding (data) 16 trees (membership)

  46. Multicast Performance Stretch: multicast latency / ideal (unicast) latency 1740-node measurement-derived topology 1 Cumulative Fraction of Nodes 0.8 0.6 0.4 0.2 0 0 1 2 3 4 5 6 7 8 Stretch

  47. Conclusion Census: a platform for membership management and communication in large distributed systems Provides consistent views while scaling to extreme sizes Support future wide-scale distributed applications Builds on an efficient multicast mechanism High reliability, low latency, low bandwidth overhead Exploit consistent knowledge High performance while avoiding complexity

  48. Conclusion Census: a platform for membership management and communication in large distributed systems Provides consistent views while scaling to extreme sizes Support future wide-scale distributed applications Builds on an efficient multicast mechanism High reliability, low latency, low bandwidth overhead Exploit consistent knowledge High performance while avoiding complexity Thank you. Questions?

Recommend


More recommend