anti entropy using crdts on ha datastores
play

Anti-Entropy using CRDTs on HA Datastores Sailesh Mukil Senior - PowerPoint PPT Presentation

Anti-Entropy using CRDTs on HA Datastores Sailesh Mukil Senior Software Engineer, Netflix Timeline Cassandra Multi-region Dynomite adoption 2011 2013 2016 NETFLIX Dynomite Makes non-distributed datastores, distributed NETFLIX


  1. Anti-Entropy using CRDTs on HA Datastores Sailesh Mukil Senior Software Engineer, Netflix

  2. Timeline Cassandra Multi-region Dynomite adoption 2011 2013 2016 NETFLIX

  3. Dynomite Makes non-distributed datastores, distributed NETFLIX

  4. Dynomite Overview 33% 33% 33% Datastore NETFLIX

  5. Dynomite Overview Replica 1 Replica 2 Replica 3 NETFLIX

  6. Client Replica 1 Replica 2 Replica 3 NETFLIX

  7. Client Replica 1 Replica 2 Replica 3 NETFLIX

  8. Client Replica 1 Replica 2 Replica 3 NETFLIX

  9. Dynomite overview ● Global replication ● Pluggable datastores (Redis primarily) ● High availability ● Shared nothing ● Multiple quorum levels ● Auto-sharding ● Supports ● Linear scale datastore API NETFLIX

  10. Dynomite footprint @ Netflix ● ~1000 customer facing nodes ● ~1M OPS/s ● Largest cluster holds ~6 TB NETFLIX

  11. The problem Entropy in the system NETFLIX

  12. Entropy in SET K 123 the system R-1 R-2 R-3 NETFLIX

  13. Entropy in SET K 123 the system K: 123 R-1 K: 123 K: 123 R-2 R-3 NETFLIX

  14. Entropy in the system OK K: 123 R-1 K: 123 K: 123 R-2 R-3 NETFLIX

  15. Entropy in the system K: 123 SET K 456 R-1 K: 123 K: 123 R-2 R-3 NETFLIX

  16. Entropy in the system K: 123 SET K 456 R-1 K: 123 K: 123 K: 456 R-2 R-3 NETFLIX

  17. Entropy in the system K: 123 ERR R-1 K: 123 K: 123 K: 456 R-2 R-3 NETFLIX

  18. Entropy in the system K: 123 SET K 789 R-1 K: 123 K: 123 K: 456 R-2 R-3 NETFLIX

  19. Entropy in the system K: 123 SET K 789 R-1 K: 123 K: 789 K: 123 K: 456 R-2 R-3 NETFLIX

  20. Entropy in the system K: 123 ERR R-1 K: 123 K: 789 K: 123 K: 456 R-2 R-3 NETFLIX

  21. K: 123 GET K GET K R-1 K: 123 K: 789 K: 123 K: 456 R-2 R-3 NETFLIX

  22. K: 123 789 456 R-1 K: 123 K: 789 K: 123 K: 456 R-2 R-3 NETFLIX

  23. GET K (w/quorum) K: 123 R-1 K: 123 K: 789 K: 123 K: 456 R-2 R-3 NETFLIX

  24. GET K (w/quorum) K: 123 R-1 K: 123 K: 789 K: 123 K: 456 R-2 R-3 NETFLIX

  25. GET K (w/quorum) K: 123 R-1 123 456 K: 123 K: 789 K: 123 K: 456 R-2 R-3 NETFLIX

  26. K: 123 ERR: R-1 QUORUM FAILED K: 123 K: 789 K: 123 K: 456 R-2 R-3 NETFLIX

  27. K: 123 ERR: R-1 QUORUM 123 FAILED 456 K: 123 K: 789 K: 123 K: 456 R-2 R-3 NETFLIX

  28. Replicas will go out of sync NETFLIX

  29. Timeline Cassandra Dynomite Multi-region Dynomite w/ adoption CRDTs 2011 2013 2016 2019 NETFLIX

  30. Achieving anti-entropy (traditionally) Last Writer Wins Vector Clocks Uses Physical timestamps Shows causal relationships ● ● Clock skew But not for concurrent writes ● ● NETFLIX

  31. The solution Conflict free replicated data types NETFLIX

  32. Conflict free replicated data types A CRDT is a data structure which can be replicated across the network , where the replicas can be updated independently and concurrently without coordination between the replicas, and where it is always mathematically possible to resolve inconsistencies which might result. NETFLIX SECTION DIVIDER

  33. Associative Commutative Idempotent Grouping of operations Order of operations do Duplication of does not matter not matter operations does not matter (X + Y) + Z = X + (Y + Z) X + Y = Y + X X + X = X NETFLIX

  34. Types of operations on CRDTs Update Merge Updates local state Converges replica states ● ● NETFLIX

  35. Introduction to CRDTs When we write, we update When we repair, we merge Read repair = merge on read path NETFLIX

  36. Introduction to CRDTs CRDTs provide strong eventual consistency NETFLIX

  37. Naive INCR CTR distributed counter CTR: 1 R-1 CTR: 1 CTR: 1 R-2 R-3 NETFLIX

  38. Naive distributed counter CTR: 1 DECR CTR INCR CTR R-1 CTR: 1 CTR: 0 CTR: 1 CTR: 2 R-2 R-3 NETFLIX

  39. Naive distributed counter CTR: 1 Repair based on timestamp? R-1 Latest value is 2, which is incorrect CTR: 1 CTR: 0 CTR: 1 CTR: 2 R-2 R-3 NETFLIX

  40. CRDT: PNCounters Each replica maintains 2 “local” counters Positive counter: Tracks increments ● Negative counter: Tracks decrements ● Final counter value: (Sum of all PCounters - Sum of all NCounters) NETFLIX

  41. CRDT: INCR CTR PNCounter 1 0 0 0 CTR: 0 0 0 1 0 0 1 0 0 R-1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 CTR: CTR: 0 0 0 0 0 0 R-2 R-3 NETFLIX

  42. CRDT: PNCounter 1 0 0 0 CTR: 0 0 0 DECR CTR INCR CTR R-1 1 0 0 0 1 0 0 0 1 CTR: CTR: 0 0 1 0 0 0 0 R-2 R-3 NETFLIX

  43. CRDT: PNCounter CTR = 1 1 0 0 0 CTR: 0 0 0 R-1 CTR = 0 CTR = 2 1 0 0 0 1 1 0 0 1 CTR: CTR: 0 0 1 0 0 0 0 R-2 R-3 NETFLIX

  44. CRDT: GET CTR PNCounter 1 0 0 0 CTR: 0 0 0 1 0 1 1 0 0 R-1 0 0 0 0 1 0 0 1 0 0 0 1 1 0 1 CTR: CTR: 0 1 0 0 0 0 0 R-2 R-3 NETFLIX

  45. repair CRDT: (merge) GET CTR PNCounter 1 0 0 0 1 CTR: 0 1 0 0 R-1 repair repair (merge) (merge) 0 1 0 0 1 0 1 1 0 1 CTR: CTR: 0 1 0 0 0 1 0 0 R-2 R-3 NETFLIX

  46. CRDT: PNCounter CTR = 1 0 1 0 0 1 CTR: 0 0 1 0 R-1 CTR = 1 CTR = 1 0 1 0 1 0 0 1 1 0 1 CTR: CTR: 0 1 0 0 0 0 1 0 R-2 R-3 NETFLIX

  47. CRDT: LWW-Element Set Used for registers, hashmaps and sorted sets Used to maintain key metadata Add set: Latest update timestamps for keys ● Remove set: Timestamps at which keys were removed ● Registers can take arbitrary values Hence we still require LWW to resolve conflicts ● NETFLIX

  48. LWW-Element SET K1 123 (t1) Set K1 add t1 rem K1: 123 R-1 K1 K1 add add t1 t1 rem rem K1: 123 K1: 123 R-2 R-3 NETFLIX

  49. LWW-Element Set K1 add t1 rem SET K1 K1: 123 456 (t2) R-1 K1 K1 add add t1 t1 t2 rem rem K1: 123 K1: 456 K1: 123 R-2 R-3 NETFLIX

  50. LWW-Element Set K1 K2 add t1 t3 rem SET K2 K2: 999 K1: 123 999 (t3) R-1 K1 K2 K1 add add t1 t3 t2 t1 rem rem K1: 123 K2: 999 K1: 456 R-2 R-3 NETFLIX

  51. repair LWW-Element GET K1 Set K1 K2 add t2 t1 t3 rem t2 > t1 => 456 latest value K1: 456 K2: 999 K1: 123 R-1 K1 = 123 (t1) K1 = 456 (t2) K1 K2 K1 add add t1 t3 t2 t1 rem rem K1: 123 K2: 999 K1: 456 R-2 R-3 NETFLIX

  52. LWW-Element “456” Set K1 K2 add t2 t3 rem K1: 456 K2: 999 R-1 repair K1 K2 K1 add add t1 t3 t2 t1 t2 rem rem K1: 456 K1: 123 K2: 999 K1: 456 R-2 R-3 NETFLIX

  53. LWW-Element GET K2 Set K1 K2 add t2 t3 rem K1: 456 K2: 999 R-1 K2 = 999 (t3) (nil) K1 K2 K1 add add t3 t1 t2 t2 rem rem K1: 456 K2: 999 K1: 456 R-2 R-3 NETFLIX

  54. LWW-Element “999” Set K1 K2 add t2 t3 rem K1: 456 K2: 999 R-1 repair K1 K2 K1 K2 add add t3 t1 t2 t3 t2 rem rem K1: 456 K2: 999 K1: 456 K2: 999 R-2 R-3 NETFLIX

  55. LWW-Element Set K1 K2 add t2 t3 rem DEL K2 K1: 456 K2: 999 (t4) R-1 K1 K2 K1 K2 add add t3 t1 t2 t3 t2 K2 rem rem t4 K1: 456 K2: 999 K1: 456 K2: 999 R-2 R-3 NETFLIX

  56. LWW-Element Set K1 K2 add t2 t3 rem GET K2 K1: 456 K2: 999 “999” R-1 K1 K2 K1 K2 add add t3 t1 t2 t3 t2 K2 rem rem t4 K1: 456 K1: 456 K2: 999 R-2 R-3 NETFLIX

  57. LWW-Element Set K1 K2 add t2 t3 rem GET K2 K1: 456 K2: 999 K2 = 999 (t3) R-1 K1 K2 K1 K2 add add t3 t2 t1 t3 t2 K2 K2 rem rem t4 t4 K2 del @t4 K1: 456 K1: 456 K2: 999 R-2 R-3 NETFLIX

  58. repair LWW-Element Set K1 K2 add t2 t3 K2 rem t4 (nil) K1: 456 K2: 999 DEL K2 (t4) R-1 K1 K2 K1 K2 add add t3 t2 t1 t3 t2 K2 K2 rem rem t4 t4 K1: 456 K1: 456 R-2 R-3 NETFLIX

  59. Implementation challenges (LWW-element set) Redis doesn’t maintain timestamps Dynomite can track the timestamp of the client request NETFLIX

  60. Implementation challenges (LWW-element set) We’d like Dynomite to remain stateless Store the metadata inside Redis NETFLIX

  61. Implementation challenges (LWW-element set) Operations must modify data and metadata atomically Rewrite operations into Redis Lua scripts (guarantees atomicity) NETFLIX

  62. Implementation challenges (LWW-element set) Does the remove set grow forever? Delete metadata ASAP from remove set if ALL replicas agree Background thread cleans rest Maintain remove set as sorted set NETFLIX

  63. Implementation challenges (LWW-element set) What does an example Lua script look like? Check if update is old Discard if it is Update data + metadata otherwise NETFLIX

  64. Repairs occur on read path in Dynomite Repairs for point reads only NETFLIX

  65. Background repairs (Note: Ongoing work) NETFLIX

  66. Background repairs Repairing on range reads is expensive Eg: Give me all members of a set Return everything in this hashmap Return me a range from this sorted set NETFLIX

  67. Background repairs How do we target keys that need repairing? Full key walk? (like Cassandra) NETFLIX

  68. Background repairs How do we target keys that need repairing? Maintain list of recently written to keys Run merge operation on them (async) But, merge operation on large structures are expensive NETFLIX

Recommend


More recommend