COPS
Scalable Causal Consistency for Wide-Area Storage
A presentation by
Maxymilian Śmiech
But here are those who did the work: Wyatt Lloyd, Michael J. Freedman, Michael Kaminsky, David G. Andersen
Maxymilian miech But here are those who did the work: Wyatt Lloyd, - - PowerPoint PPT Presentation
COPS Scalable Causal Consistency for Wide-Area Storage A presentation by Maxymilian miech But here are those who did the work: Wyatt Lloyd, Michael J. Freedman, Michael Kaminsky, David G. Andersen What it will be about 1. Problem definition
Scalable Causal Consistency for Wide-Area Storage
But here are those who did the work: Wyatt Lloyd, Michael J. Freedman, Michael Kaminsky, David G. Andersen
– Give consistent view of data – Be always available – Perform well in case of partitioning
early search engines – synchronization was not critical
modern social networks – inconsistent data leads to frustration
We can't have all C, A, P. Instead we trade strong consistency for ability to easily achieve low latency and high scalability. CAP → ALPS But we don't want to give up whole C – single view of data helps writing simple software
– Causal consistency – Convergent conflict handling
No need to handle them at application level
1.Save uploaded photo (and its metadata) 2.Add photo (its reference) to an album
1.Read album data (list of photo references) 2.For each photo reference, put link on the page
Bob won't get affected by references to pictures not present in „his” cluster. If not, he may get a „404” error! At application level we must check if data store has all photos referenced from the album. If not – we don't render bad links on the page. Each time the album is viewed, we check which photos are available. We shouldn't think that way!
Now each cluster checks if it has received the
that dangling photo reference. Old album contents will be returned even if updated version is available. Cluster delays updates received from remote cluster, until all dependencies are satisfied. Result: when data store returns updated album, the application can be sure that new photo is also available.
resolve conflicts between two values assigned to the same key.
commutative.
independent of conflict resolution order.
execute some processing or just „add” both possibilities, put them as a new value and let application handle it later.
– COPS
Read/write single pieces of data. Reads always return values according to causal+.
– COPS-GT
Get transaction – ability to retrieve consistent set of values.
application (front-end) servers talking to local storage cluster
entirely in single datacenter
provide low latency of local
partitioning (cluster is linearizable)
– Data is constantly exchanged with other
datacenters without blocking current operations
– Data always respects causal+ properties...
…even if any of datacenters fails, dependencies are preserved
value = get(key) put(key, value)
(application server) when performing operations on the data store All communication between threads happen through COPS (so dependencies can be tracked)
before b, then a→b If a is put(k,v) and b is get(k) which returns the value put by a, then a→b a→b and b→c implies a→c
b ↛ and b a ↛ , then a and b are concurrent.
They are unrelated and can be replicated independently.
But if such a is put(k,v) and b is put(k,w), then a and b are in conflict.
It must be resolved.
There should be more arrows, but they are implied by those shown above
Node – part of a linearizable key-value store with additional extensions to support replication in a causal+ way Application context to track dependencies in execution thread
[In COPS]
[In COPS-GT]
COPS (application server) handles multiple user sessions
put(key,val). It respects causal dependencies (larger timestamp means later update)
and is send with replication messages. Receiver chooses maximum-plus-one from received value and its own counter as the time of message arrival and receiver's current counter.
default convergent conflict handling (we get global order on updates to the same key so just let last write win)
library and number of checks done by nodes
version is increasing with causally-related puts to key
context (application saw the val, so next actions may be somehow based on it)
dependencies for key
COPS: After, it will clear current context and add single <key, ver> for that put
It is possible in COPS because of transitivity of dependencies – only the nearest are needed and this put is nearer than anything before
COPS-GT cannot remove anything because it must be able to support get transactions
ver = null Primary node is responsible for assigning ver and returning it to the client library. In local cluster all dependencies are already satisfied.
Primary node asynchronously issues the same put_after to remote primary nodes, but including previously assigned ver
It is called by the remote node for each of the nearest dependencies to determine if it is satisfied in receiver's cluster. Remember that each key is assigned to single node – that node will not return from above call until it has written required dependency. That dependency will be asynchronously replicated between that node and its corresponding one from sender's cluster.
for the key.
Always the latest version is returned (and stored internally).
<key, ver> will be added
Default behavior is to get latest version, but older versions can be retrieved, so get_trans will work properly.
<key, ver, deps> will be added
1.Get permissions of the album 2.If „public”, get the album and show it
Wrong: there is a race condition – after (1) Alice could add naked photos and change permissions to „private”.
1.Get the album 2.Get permissions, if „public”, show the album
Wrong: race again – after (1) Alice could remove naked photos and change permissions to „public”.
multiple keys
replication (no single serialization point)
Dependencies are to ensure proper order of visible updates in remote cluster, but don't limit the order of replication messages.
Instead of get, COPS-GT has get_trans
# @param keys list of keys # @param ctx_id context id # @return values list of values function get_trans(keys, ctx_id): for k in keys # Get keys in parallel (first round) results[k] = get_by_version(k, LATEST) for k in keys # Calculate causally correct versions (ccv) ccv[k] = max(ccv[k], results[k].vers) for dep in results[k].deps if dep.key in keys ccv[dep.key] = max(ccv[dep.key], dep.vers) for k in keys # Get needed ccvs in parallel (second round) if ccv[k] > results[k].vers results[k] = get_by_version(k, ccv[k]) update_context(results, ctx_id) # Update the metadata stored in the context return extract_values(results) # Return only the values to the client
1.Get latest versions of all keys 2.Get specific versions of some keys Second round happens if during first round some to-be-read keys were concurrently updated and depend on newer versions of
If not, number of rounds could be infinite
If we read A (so its available) and it depends on B, then B must be also available (because of causality)
get_trans When key is updated during running get_trans, its
After that, old version(s) may be deleted
seconds), so versions older than it can be deleted In case of timeout, client library restarts operation – get_trans will read new versions of keys
– put:get ratio captures average relation between number of put
and get operations
– variance is a chance that different clients access the same
keys (larger value means more interaction between clients)
Above values have direct impact on the size of dependency lists each node must keep and process when doing get/put operations. Size of those lists affects performance.
simulating single-node-per-cluster log exchange method (which is causal but not scalable)
Note to legend put:get
COPS and COPS-GT offer similar throughput when delay between operations is long enough
We already know that dependency lists can be garbage-collected after replication to all
(and replicated) with each version.
Note to legend (variance)
COPS and COPS-GT offer similar throughput when mostly gets are issued (read-heavy)
When put:get lowers, dependency lists size increases – each get inherits new dependencies on values put by
their dependencies have time to get fully replicated and can be excluded from lists. That is why list's size decreases.
In single-node-per-cluster setting, COPS is as good as log exchange solution. COPS strength is that it scales well (almost linearly) with number of nodes per cluster. COPS-GT offers almost the same throughput as COPS in all expected real-world workloads (defaults are rather abstract).
They provide ALPS, but not causality
Limited to single-node clusters, so are just ALP
– Distributed DBs, require two-phase, wide-area locks. They
are C, but are not A, L, P and S (more or less)