Maxymilian miech But here are those who did the work: Wyatt Lloyd, - - PowerPoint PPT Presentation

maxymilian miech
SMART_READER_LITE
LIVE PREVIEW

Maxymilian miech But here are those who did the work: Wyatt Lloyd, - - PowerPoint PPT Presentation

COPS Scalable Causal Consistency for Wide-Area Storage A presentation by Maxymilian miech But here are those who did the work: Wyatt Lloyd, Michael J. Freedman, Michael Kaminsky, David G. Andersen What it will be about 1. Problem definition


slide-1
SLIDE 1

COPS

Scalable Causal Consistency for Wide-Area Storage

A presentation by

Maxymilian Śmiech

But here are those who did the work: Wyatt Lloyd, Michael J. Freedman, Michael Kaminsky, David G. Andersen

slide-2
SLIDE 2

What it will be about

  • 1. Problem definition
  • 2. Idea of the solution
  • 3. Implementation overview
  • 4. Performance analysis
  • 5. Previous work
  • 6. Summary
slide-3
SLIDE 3

The ultimate goal

  • Distributed storage system should:

– Give consistent view of data – Be always available – Perform well in case of partitioning

slide-4
SLIDE 4

Unfortunately: CAP Theorem

  • It is not possible to have strongly consistent

(linearizable), always available system with partition tolerance

  • In practice we sacrifice consistency
slide-5
SLIDE 5

Over the years...

  • It was sufficient in the past

early search engines – synchronization was not critical

  • Now we have distributed systems with

complex dependencies

modern social networks – inconsistent data leads to frustration

slide-6
SLIDE 6

What is worth fighting for

  • Availability
  • low Latency
  • Partition tolerance
  • high Scalability

We can't have all C, A, P. Instead we trade strong consistency for ability to easily achieve low latency and high scalability. CAP → ALPS But we don't want to give up whole C – single view of data helps writing simple software

„Always on” experience

slide-7
SLIDE 7

Solution – COPS data store

  • Clusters of Order-Preserving Servers
  • It has causal+ consistency:

– Causal consistency – Convergent conflict handling

  • Causal+ is the strongest consistency model

achievable under ALPS constraints

slide-8
SLIDE 8

Causal consistency

  • Ensures dependencies between data

No need to handle them at application level

  • Case study: Alice adds a photo to an album:

1.Save uploaded photo (and its metadata) 2.Add photo (its reference) to an album

Now Bob opens album page:

1.Read album data (list of photo references) 2.For each photo reference, put link on the page

slide-9
SLIDE 9

Causal vs Eventual

  • In eventual data store, cluster can return updates „out
  • f order”. Therefore application server must ensure that

Bob won't get affected by references to pictures not present in „his” cluster. If not, he may get a „404” error! At application level we must check if data store has all photos referenced from the album. If not – we don't render bad links on the page. Each time the album is viewed, we check which photos are available. We shouldn't think that way!

slide-10
SLIDE 10

Causal vs Eventual

  • We switch to causal

Now each cluster checks if it has received the

  • photo. If not – it will return old album info, without

that dangling photo reference. Old album contents will be returned even if updated version is available. Cluster delays updates received from remote cluster, until all dependencies are satisfied. Result: when data store returns updated album, the application can be sure that new photo is also available.

slide-11
SLIDE 11

Convergent conflict handling

  • Every cluster uses the same handler function to

resolve conflicts between two values assigned to the same key.

  • We require that handler is associative and

commutative.

  • That ensures convergence to the same final value,

independent of conflict resolution order.

  • Handler can be provided by the application. It can

execute some processing or just „add” both possibilities, put them as a new value and let application handle it later.

slide-12
SLIDE 12

Design details

  • Two versions:

– COPS

Read/write single pieces of data. Reads always return values according to causal+.

– COPS-GT

Get transaction – ability to retrieve consistent set of values.

  • They differ in stored metadata – single system

must consist of same-type clusters.

slide-13
SLIDE 13

Assumptions

  • Small number of big datacenters
  • Each datacenter contains

application (front-end) servers talking to local storage cluster

  • Each cluster keeps copy
  • f all data and is contained

entirely in single datacenter

  • Datacenters are good enough to

provide low latency of local

  • perations and resistance to

partitioning (cluster is linearizable)

slide-14
SLIDE 14

Expectations

  • COPS requires powerful datacenters, so what

it gives in return? Asynchronous replication in the background:

– Data is constantly exchanged with other

datacenters without blocking current operations

– Data always respects causal+ properties...

…even if any of datacenters fails, dependencies are preserved

slide-15
SLIDE 15

COPS (abstract) interface

  • Nothing more than a simple key-value store:

value = get(key) put(key, value)

  • Execution thread – a stateful „session” used by client

(application server) when performing operations on the data store All communication between threads happen through COPS (so dependencies can be tracked)

slide-16
SLIDE 16

Causality relation

  • If a and b happen in single execution thread and a happens

before b, then a→b If a is put(k,v) and b is get(k) which returns the value put by a, then a→b a→b and b→c implies a→c

  • If a→b and both are puts, we say that b depends on a
  • If a

b ↛ and b a ↛ , then a and b are concurrent.

They are unrelated and can be replicated independently.

But if such a is put(k,v) and b is put(k,w), then a and b are in conflict.

It must be resolved.

slide-17
SLIDE 17

Causality relation: example

There should be more arrows, but they are implied by those shown above

slide-18
SLIDE 18

Architecture

Node – part of a linearizable key-value store with additional extensions to support replication in a causal+ way Application context to track dependencies in execution thread

slide-19
SLIDE 19

Dividing the keyspace

  • Each cluster has full copy of key-value set

Cluster can use consistent hashing or other methods of dividing keyspace between its nodes

  • Cluster can use chain replication for fault
  • tolerance. For each key there is single primary

node per cluster. Only primary nodes of corresponding keys exchange messages between clusters

slide-20
SLIDE 20

Library interface

  • ctx_id = createContext()
  • bool = deleteContext(ctx_id)
  • bool = put(key, value, ctx_id)
  • value = get(key, ctx_id)

[In COPS]

values = get_trans(keys, ctx_id)

[In COPS-GT]

  • ctx_id is used to track specific context when single client of

COPS (application server) handles multiple user sessions

slide-21
SLIDE 21

Lamport timestamp

  • Used to assign version to <key, value> after each

put(key,val). It respects causal dependencies (larger timestamp means later update)

  • Basically: counter is incremented before each local update

and is send with replication messages. Receiver chooses maximum-plus-one from received value and its own counter as the time of message arrival and receiver's current counter.

  • Combined with unique node identifier allows to implement

default convergent conflict handling (we get global order on updates to the same key so just let last write win)

slide-22
SLIDE 22

Nearest dependencies

  • Used to limit the size of metadata kept by the client

library and number of checks done by nodes

  • COPS-GT must keep all dependencies
slide-23
SLIDE 23

Dependencies

  • Context keeps <key, version, [deps]> entries

version is increasing with causally-related puts to key

  • val = get(key) adds <key, version, [deps]> to the

context (application saw the val, so next actions may be somehow based on it)

  • put(key,val) uses current context as set of

dependencies for key

COPS: After, it will clear current context and add single <key, ver> for that put

It is possible in COPS because of transitivity of dependencies – only the nearest are needed and this put is nearer than anything before

COPS-GT cannot remove anything because it must be able to support get transactions

slide-24
SLIDE 24

Replication: sender's cluster

  • <bool,ver> = put_after(key,val[,deps],nearest,ver)
  • Write to local cluster:

ver = null Primary node is responsible for assigning ver and returning it to the client library. In local cluster all dependencies are already satisfied.

  • Remote replication:

Primary node asynchronously issues the same put_after to remote primary nodes, but including previously assigned ver

slide-25
SLIDE 25

Replication: receiver's cluster

  • bool = dep_check(key,ver)

It is called by the remote node for each of the nearest dependencies to determine if it is satisfied in receiver's cluster. Remember that each key is assigned to single node – that node will not return from above call until it has written required dependency. That dependency will be asynchronously replicated between that node and its corresponding one from sender's cluster.

  • dep_check can time out, it may be because of node
  • failure. It is called again, probably to other node responsible

for the key.

slide-26
SLIDE 26

COPS: Retrieving data

  • <val, ver> = get_by_version(key)

Always the latest version is returned (and stored internally).

  • Client library will update context accordingly:

<key, ver> will be added

slide-27
SLIDE 27

COPS-GT: Retrieving data

  • <val, ver, deps> = get_by_version(key, ver)

Default behavior is to get latest version, but older versions can be retrieved, so get_trans will work properly.

  • Client library will update context accordingly:

<key, ver, deps> will be added

slide-28
SLIDE 28

COPS-GT: Get transaction

  • Motivation: Eve wants to see Alice's photo album:

1.Get permissions of the album 2.If „public”, get the album and show it

Wrong: there is a race condition – after (1) Alice could add naked photos and change permissions to „private”.

  • Fix it with reverse-causal order of reads:

1.Get the album 2.Get permissions, if „public”, show the album

Wrong: race again – after (1) Alice could remove naked photos and change permissions to „public”.

slide-29
SLIDE 29

Get transaction

  • We could provide read/write transactions to work on

multiple keys

  • But COPS allows independent writes for scalability of

replication (no single serialization point)

Dependencies are to ensure proper order of visible updates in remote cluster, but don't limit the order of replication messages.

  • Reads should also use those dependencies:

Instead of get, COPS-GT has get_trans

slide-30
SLIDE 30

Get transaction: algorithm

# @param keys list of keys # @param ctx_id context id # @return values list of values function get_trans(keys, ctx_id): for k in keys # Get keys in parallel (first round) results[k] = get_by_version(k, LATEST) for k in keys # Calculate causally correct versions (ccv) ccv[k] = max(ccv[k], results[k].vers) for dep in results[k].deps if dep.key in keys ccv[dep.key] = max(ccv[dep.key], dep.vers) for k in keys # Get needed ccvs in parallel (second round) if ccv[k] > results[k].vers results[k] = get_by_version(k, ccv[k]) update_context(results, ctx_id) # Update the metadata stored in the context return extract_values(results) # Return only the values to the client

slide-31
SLIDE 31

Get transaction: properties

  • There are two rounds:

1.Get latest versions of all keys 2.Get specific versions of some keys Second round happens if during first round some to-be-read keys were concurrently updated and depend on newer versions of

  • ther already-retrieved keys
  • Second round must be able to get not-latest versions of keys

If not, number of rounds could be infinite

  • Both rounds read data only from the local cluster

If we read A (so its available) and it depends on B, then B must be also available (because of causality)

  • Retrieved versions may not be the newest, but are consistent
slide-32
SLIDE 32

Garbage collection: COPS-GT

  • COPS-GT must keep old versions of keys to support

get_trans When key is updated during running get_trans, its

  • ld version(s) must be kept until that transaction ends

After that, old version(s) may be deleted

  • We limit running time of get_trans (default is 5

seconds), so versions older than it can be deleted In case of timeout, client library restarts operation – get_trans will read new versions of keys

slide-33
SLIDE 33

Garbage collection: COPS-GT

  • COPS-GT must keep all dependencies of

keys to support get_trans When some version Kv is written to all clusters (plus running time of get_trans), then we know that all its dependencies were also written When that happens, dependency list of Kv can be deleted

  • In case of long partition between clusters,

dependency lists will consume large amount of space

slide-34
SLIDE 34

Evaluation

  • Variables on diagrams:

– put:get ratio captures average relation between number of put

and get operations

– variance is a chance that different clients access the same

keys (larger value means more interaction between clients)

Above values have direct impact on the size of dependency lists each node must keep and process when doing get/put operations. Size of those lists affects performance.

  • LOG – it is COPS without dependency tracking,

simulating single-node-per-cluster log exchange method (which is causal but not scalable)

slide-35
SLIDE 35

Evaluation: COPS vs COPS-GT

Note to legend put:get

COPS and COPS-GT offer similar throughput when delay between operations is long enough

slide-36
SLIDE 36

Evaluation: COPS-GT

We already know that dependency lists can be garbage-collected after replication to all

  • clusters. That is why long delays between
  • perations help reduce lists size associated

(and replicated) with each version.

slide-37
SLIDE 37

Evaluation: COPS vs COPS-GT

Note to legend (variance)

COPS and COPS-GT offer similar throughput when mostly gets are issued (read-heavy)

slide-38
SLIDE 38

Evaluation: COPS-GT

When put:get lowers, dependency lists size increases – each get inherits new dependencies on values put by

  • ther clients. After some point, put's are rare, so that

their dependencies have time to get fully replicated and can be excluded from lists. That is why list's size decreases.

slide-39
SLIDE 39

Evaluation: scalability

In single-node-per-cluster setting, COPS is as good as log exchange solution. COPS strength is that it scales well (almost linearly) with number of nodes per cluster. COPS-GT offers almost the same throughput as COPS in all expected real-world workloads (defaults are rather abstract).

slide-40
SLIDE 40

Other systems

  • Eventual: Dynamo, Voldemort, Cassandra

They provide ALPS, but not causality

  • Causal: Bayou, TACT, PRACTI

Limited to single-node clusters, so are just ALP

  • Transactional: R*, Walter

– Distributed DBs, require two-phase, wide-area locks. They

are C, but are not A, L, P and S (more or less)

slide-41
SLIDE 41

Summary

  • COPS can provide ALPS properties to today's

large-scale distributed systems

  • COPS is causal+, a property invaluable to

supporting complex (feature-rich, not focused on resolving dependencies/conflicts) application logic

  • COPS is causal and scales well on multiple

nodes

  • COPS-GT is even more consistent at the cost of
  • nly small decrease in performance
slide-42
SLIDE 42

Questions & Answers

?