transactional storage for geo replicated systems
play

Transactional storage for geo-replicated systems Yair Sovran - PowerPoint PPT Presentation

Transactional storage for geo-replicated systems Yair Sovran Russell Power Marcos K. Aguilera Jinyang Li New York University Microsoft Research Silicon Valley Presentation by Wojciech tak Geo-replication Network


  1. Transactional storage for geo-replicated systems Yair Sovran ∗ Russell Power ∗ Marcos K. Aguilera † Jinyang Li ∗ ∗ New York University † Microsoft Research Silicon Valley Presentation by Wojciech Żółtak

  2. Geo-replication ● Network latencies between distant places may be very high. ● In case of natural disaster the whole data-center may be destroyed. So, we replicate our service across many sites around the World and redirect users to the closest one.

  3. Geo-replication of storage systems ● Application logic changes rarely and is easy to replicate. ● Data in store changes frequently and is hard to replicate, especially when supporting transactions . We are going to focus on a key-value store with transaction support.

  4. Why transactions? Transactions removes the burden of carrying about problems like: ● race conditions, ● partial writes ● overwrites, therefore, makes developing much easier.

  5. Write-write conflicts Conflicting writes to replicated sites: ? u p d update a t e Obj X Obj X Site A Site B How to merge updates?

  6. Write-write conflicts Master-slave architecture ● Read-write master u p d a t e ● Replicated, read- Obj X only slaves Master Slave update Obj X Master quickly becomes a bottleneck. Better solution needed.

  7. Goals ● Asynchronous replication across sites. ● Efficient update-anywhere for certain objects. ● Freedom from conflict-resolution logic. ● Strong isolation within each site. Current systems provides only a subset of above properties.

  8. 1. Parallel Snapshot Isolation (PSI) Problems with SI in replicated systems: ● Total ordering of commit time for all transactions in a whole system (even if they do not conflict). ● Transaction is visible only after its writes have been propagated to all sites. PSI is a new isolation property which adapts SI for replicated systems.

  9. SI vs PSI, properties SI PSI (Snapshot Read) All operations read (Site Snapshot Read) All operations read the most recent the most recent committed version as committed version at the transaction's site as of the time of the time when transaction began. when transactions began (No Write-Write Conflicts) The write set (No Write-Write Conflict) The write sets of each pair of of each pair of committed concurrent committed somewhere-concurrent [1] transactions must be transactions must be disjoint. disjoint. (Commit Causality Across Sites) If a transaction T 1 commits at a site A before transaction T 2 starts at site A , then T 1 cannot commit after T 2 at any site. Note, that PSI guarantees a SI within single site. [1] T 1 and T 2 are somewhere-concurrent when they are concurrent [2] at two (not necessarily different) sites. [2] T 1 and T 2 are concurrent if one of them has a commit timestamp between start and commit timestamp of the other one.

  10. SI vs PSI, properties Example of SI

  11. SI vs PSI, properties Example of PSI (commit timestamp may differ at different sites)

  12. SI vs PSI, anomalies

  13. 2. Preferred sites We can use sharding for writing - i.e. associate objects with concrete sites and redirect writing to them. It is called a primary sites mechanism. But, the transaction may contain writes of objects associated with different sites, which is problematic. Instead, the slightly less restrictive property is introduced. It is called preffered sites .

  14. 2. Preferred sites ● Each object is associated to a concrete site (e.g. user data is associated to the site which is the closest to his usual location). ● Object writing at its preferred site is guaranteed to be conflict-free with other sites. ● Object writing at site which is not its preferred site is still permitted. We will see later how those properties can be achieved and what benefits do they provide.

  15. 3. CSet, a commutative data type ● Data type is commutative when all operations on it are commutative. I.e. we can change order of operations on commutative data type and the result will be the same. ● CSet is a commutative data type which implements a set.

  16. 3. CSet, a commutative data type Implementation CSet : Key -> Int ● empty CSet maps every key to 0 ● CSet.contains(X) = true if X is mapped to a positive integer ● CSet.add(X) increases number associated with X ● CSet.del(X) decreases number associated with X

  17. 3. CSet, a commutative data type ● Because concurrent operations may remove / add the same object to the same CSet the counter may not be 0 or 1. ● Application should decrease/increase until the counter is (not) positive. ● That behaviour may be encapsulated by interface and transparent to the user.

  18. 3. CSet, benefits ● Since CSet is a commutative data type it can be modified at any site without introduction of write-write conflicts, since merge of different updates is trivial. ● Set is a very useful structure which can be used to aggregate data (like user's posts, friends, basket content etc). ● CSets may be used to eliminate some situations that would involve updating objects with different preferred sites within single transaction (e.g. modifying a symmetrical relations like friendship)

  19. Putting all things together - Walter cluster storage Site 1 Site 2 configuration service user B user A ● Data is divided into containers which are simply a logical organization units. ● Object ID contains container ID and therefore can not be moved between containers. ● Every container is associated with a preferred site, and set of sites to which should be replicated ( replica set ).

  20. Putting all things together - Walter cluster storage Site 1 Site 2 configuration service user B user A ● Configuration service is a black box which tracks active sites and the preferred site and replica set for each container. ● Sites are caching containers mapping between containers and sites. ● Cluster storage is used to keep logs of all sites for safety reasons (the site state can be restored from stored log).

  21. Walter, server variables at each site

  22. Walter, executing transactions ● Version number of objects is a pair <site, seqno> . ● When transaction x starts, the startVTS timestamp vector is obtained in form of: <CommitedVTS 1 , ... , CommitedVTS n > ● Version v is visible to startVTS if seqno <= startVTS[site] ● Transaction sees a snapshot with newest visible versions of objects.

  23. Walter, executing transactions ● Writes in transaction x are stored in a temporary buffer x.updates . ● While reading an object information from x.updates is merged with information from the snapshot. ● If object is not replicated locally it is fetched from its preferred site.

  24. Walter, fast commit Write of objects with a local preferred site There is object modified since transaction beginning? Yes No ABORT Assign new seqno to x Wait until transaction number seqno-1 is committed Commit transaction Propagate to other sites

  25. Walter, fast commit

  26. Walter, slow commit Write of at least one object with non-local preferred site. Ask involved sites to lock corresponding objects All locks acquired? No Yes Unlock locked objects, then ABORT Commit x as in fast commit When x is propagated to site which is holding a related lock, the lock is released.

  27. Walter, slow commit

  28. Walter, slow commit

  29. Walter, propagation ● After commit, transactions are propagated to other sites. ● Site receives a transaction x and sends an ACK. ● When transaction is received by at least f+1 sites (for some), than it is marked as disaster-safe and all sites are notified. ● Sites merges transaction x when all transactions from x. startVTS are merged and x is disaster-safe. ● When x is committed at all sites, it is marked as globally-visible.

  30. Walter, failures ● Site can be restored from data stored in cluster storage system. ● System can either wait for site to be back online or find a best replacement within other nodes and reassign preferred sites. ● Transactions for which not all preceding transactions can be found are discarded. ● Reactivated site can be re-integrated back to the system.

  31. Walter, partial replication ● One data center can hold a few Walter servers which are replicating a different data partitions. ● Transaction may operate on objects which are not replicated in given site. ● It can be used to scale up the system.

  32. Evaluation ● 4 sites in distant places (Virginia, California, Ireland, Singapore). ● Virtual servers equivalent to 2.5 GHz Intel Xeon with 7GB of RAM. ● Replication to all sites in test. ● 600 Mbps network between hosts in site. ● 22 Mbps network between sites. ● Transactions which are reading/writing few randomly chosen 100-byte objects.

  33. Evaluation, round trip latencies (ms)

  34. Evaluation, base performance ● Against Berkeley DB 11gR2. ● Both in master-slave architecture. ● Populated with 50k regular objects. ● Read/write of all objects in DB.

  35. Evaluation, fast commit

  36. Evaluation, fast commit

  37. Evaluation, fast commit

  38. Evaluation, fast commit

  39. Evaluation, fast commit

  40. Evaluation, slow commit

Recommend


More recommend