Maxymilian miech But here are those who did the work: Wyatt Lloyd, - PowerPoint PPT Presentation

COPS Scalable Causal Consistency for Wide-Area Storage A presentation by Maxymilian Śmiech But here are those who did the work: Wyatt Lloyd, Michael J. Freedman, Michael Kaminsky, David G. Andersen

What it will be about 1. Problem definition 2. Idea of the solution 3. Implementation overview 4. Performance analysis 5. Previous work 6. Summary

The ultimate goal ● Distributed storage system should: – Give consistent view of data – Be always available – Perform well in case of partitioning

Unfortunately: CAP Theorem ● It is not possible to have strongly consistent (linearizable), always available system with partition tolerance ● In practice we sacrifice consistency

Over the years... ● It was sufficient in the past early search engines – synchronization was not critical ● Now we have distributed systems with complex dependencies modern social networks – inconsistent data leads to frustration

What is worth fighting for ● A vailability „Always on” ● low L atency experience ● P artition tolerance ● high S calability We can't have all C , A , P . Instead we trade strong consistency for ability to easily achieve low latency and high scalability. CAP → ALPS But we don't want to give up whole C – single view of data helps writing simple software

Solution – COPS data store ● Clusters of Order-Preserving Servers ● It has causal+ consistency: – Causal consistency – Convergent conflict handling ● Causal+ is the strongest consistency model achievable under ALPS constraints

Causal consistency ● Ensures dependencies between data No need to handle them at application level ● Case study: Alice adds a photo to an album: 1.Save uploaded photo (and its metadata) 2.Add photo (its reference) to an album Now Bob opens album page: 1.Read album data (list of photo references) 2.For each photo reference, put link on the page

Causal vs Eventual ● In e ventual data store, cluster can return updates „out of order”. Therefore application server must ensure that Bob won't get affected by references to pictures not present in „his” cluster. If not, he may get a „404” error! At application level we must check if data store has all photos referenced from the album. If not – we don't render bad links on the page. Each time the album is viewed, we check which photos are available. We shouldn't think that way!

Causal vs Eventual ● We switch to causal Now each cluster checks if it has received the photo. If not – it will return old album info, without that dangling photo reference. Old album contents will be returned even if updated version is available. Cluster delays updates received from remote cluster, until all dependencies are satisfied. Result: when data store returns updated album, the application can be sure that new photo is also available.

Convergent conflict handling ● Every cluster uses the same handler function to resolve conflicts between two values assigned to the same key. ● We require that handler is associative and commutative. ● That ensures convergence to the same final value , independent of conflict resolution order. ● Handler can be provided by the application. It can execute some processing or just „add” both possibilities, put them as a new value and let application handle it later.

Design details ● Two versions: – COPS Read/write single pieces of data. Reads always return values according to causal+. – COPS-GT Get transaction – ability to retrieve consistent set of values. ● They differ in stored metadata – single system must consist of same-type clusters.

Assumptions ● Small number of big datacenters ● Each datacenter contains application (front-end) servers talking to local storage cluster ● Each cluster keeps copy of all data and is contained entirely in single datacenter ● Datacenters are good enough to provide low latency of local operations and resistance to partitioning (cluster is linearizable)

Expectations ● COPS requires powerful datacenters, so what it gives in return? Asynchronous replication in the background: – Data is constantly exchanged with other datacenters without blocking current operations – Data always respects causal+ properties... …even if any of datacenters fails, dependencies are preserved

COPS (abstract) interface ● Nothing more than a simple key-value store: value = get(key) put(key, value) ● Execution thread – a stateful „session” used by client (application server) when performing operations on the data store All communication between threads happen through COPS (so dependencies can be tracked)

Causality relation ● If a and b happen in single execution thread and a happens before b , then a→b If a is put(k,v) and b is get(k) which returns the value put by a , then a→b a→b and b→c implies a→c ● If a→b and both are puts, we say that b depends on a ● If a ↛ ↛ b and b a , then a and b are concurrent . They are unrelated and can be replicated independently. But if such a is put(k,v) and b is put(k,w) , then a and b are in conflict . It must be resolved.

Causality relation: example There should be more arrows, but they are implied by those shown above

Architecture Node – part of a linearizable key-value Application context to track store with additional extensions to dependencies in execution thread support replication in a causal+ way

Dividing the keyspace ● Each cluster has full copy of key-value set Cluster can use consistent hashing or other methods of dividing keyspace between its nodes ● Cluster can use chain replication for fault tolerance. For each key there is single primary node per cluster. Only primary nodes of corresponding keys exchange messages between clusters

Library interface ● ctx_id = createContext() ● bool = deleteContext(ctx_id) ● bool = put(key, value, ctx_id) ● value = get(key, ctx_id) [In COPS] values = get_trans(keys, ctx_id) [In COPS-GT] ● ctx_id is used to track specific context when single client of COPS (application server) handles multiple user sessions

Lamport timestamp ● Used to assign version to <key, value> after each put(key,val) . It respects causal dependencies (larger timestamp means later update) ● Basically: counter is incremented before each local update and is send with replication messages. Receiver chooses maximum-plus-one from received value and its own counter as the time of message arrival and receiver's current counter. ● Combined with unique node identifier allows to implement default convergent conflict handling (we get global order on updates to the same key so just let last write win)

Nearest dependencies ● Used to limit the size of metadata kept by the client library and number of checks done by nodes ● COPS-GT must keep all dependencies

Dependencies ● Context keeps <key, version, [deps]> entries version is increasing with causally-related puts to key ● val = get(key) adds <key, version, [deps]> to the context (application saw the val , so next actions may be somehow based on it) ● put(key,val) uses current context as set of dependencies for key COPS: After, it will clear current context and add single <key, ver> for that put It is possible in COPS because of transitivity of dependencies – only the nearest are needed and this put is nearer than anything before COPS-GT cannot remove anything because it must be able to support get transactions

Replication: sender's cluster <bool,ver> = put_after(key,val[,deps],nearest,ver) ● ● Write to local cluster: ver = null Primary node is responsible for assigning ver and returning it to the client library. In local cluster all dependencies are already satisfied. ● Remote replication: Primary node asynchronously issues the same put_after to remote primary nodes, but including previously assigned ver

Replication: receiver's cluster ● bool = dep_check(key,ver) It is called by the remote node for each of the nearest dependencies to determine if it is satisfied in receiver's cluster. Remember that each key is assigned to single node – that node will not return from above call until it has written required dependency. That dependency will be asynchronously replicated between that node and its corresponding one from sender's cluster. ● dep_check can time out, it may be because of node failure. It is called again, probably to other node responsible for the key.

COPS: Retrieving data ● <val, ver> = get_by_version(key) Always the latest version is returned (and stored internally). ● Client library will update context accordingly: <key, ver> will be added

COPS-GT: Retrieving data ● <val, ver, deps> = get_by_version(key, ver) Default behavior is to get latest version, but older versions can be retrieved, so get_trans will work properly. ● Client library will update context accordingly: <key, ver, deps> will be added

Maxymilian miech But here are those who did the work: Wyatt Lloyd, - PowerPoint PPT Presentation

COPS Scalable Causal Consistency for Wide-Area Storage A presentation by Maxymilian miech But here are those who did the work: Wyatt Lloyd, Michael J. Freedman, Michael Kaminsky, David G. Andersen What it will be about 1. Problem definition

Small scale electrochemical disinfection system with specially coated electrodes C. Sousa, K.M.

Comparing mixed media and conventional slow sand filters for arsenic removal from groundwater K.M.

Public Health Consequences of E-Cigarettes January 23, 2018 1 2 Committee David L. Eaton

Efficiency Tricks for Neural Nets Graham Neubig Site https://phontron.com/class/nn4nlp2020/

Columbia District 2015 Leadership Development Training Finance and Stewardship November 15, 2015

Intermediate Representation With the fully analyzed program expressed as an annotated AST, its

Explanatory Session for Fiscal Year Ended March 2006 June 2006 Leopalace21 Corporation This

Data-Parallel Halo Finding with Variable Linking Lengths Conference Paper November 2014 DOI:

Oil and Gas E&P Restructurings and Bankruptcies: Navigating the Unique Issues WEDNESDAY,

Consolidated Financial Statements With Independent Auditors Report Years Ended December 31,

Consolidated Financial Statements With Independent Auditors Report Years Ended December 31,

International tax impact of Ind AS Bhaumik Goda November 2018 BGSS & Associates Chartered

ASC 740 CURRENT ISSUES Practical Issues Impacting Income Tax Provisions BDO US A, LLP, a

Clothing shorts Clothing shorts Charles M. Jones Columbia Business School September 18, 2008 He who

Face Detection and Recognition Readings: Ch 8: Sec 4.4, Ch 14: Sec 4.4 Bakic flesh finder

Restrictions by Object within Article 102 TFEU and a comment on Intel An extract of my ongoing

CMC ASPECTS OF GENE THERAPY MEDICINAL PRODUCTS SME workshop: Focus on chemistry, manufacturing

2010 YE Investor Presentation The information contained herein has been prepared by Trk

I m pl em ent at i on of af f or dabl e m i cr o i r r i gat i on syst em s

MACKENNA'S GOLD Press Conference November 6, 2013 Terrific Place To Be We've seen the video

Copper and Zinc Production, Disciplined Growth. C O R P O R A T E P R E S E N T A T I O N |

A NASDAQ Traded Company - Symbol HBNC Forward-Looking Statements This presentation may contain

INVESTOR PRESENTATION Q2|20 May 26, 2020 CAUTION REGARDING FORWARD-LOOKING STATEMENTS From

RAY MON D JAME S E ME RG IN G BAN K SY MPOSIUM S E P T E M B E R 2 0 1 9 P R I V A T E &

Maxymilian miech But here are those who did the work: Wyatt Lloyd, - PowerPoint PPT Presentation

COPS Scalable Causal Consistency for Wide-Area Storage A presentation by Maxymilian miech But here are those who did the work: Wyatt Lloyd, Michael J. Freedman, Michael Kaminsky, David G. Andersen What it will be about 1. Problem definition

Small scale electrochemical disinfection system with specially coated electrodes C. Sousa, K.M.

Comparing mixed media and conventional slow sand filters for arsenic removal from groundwater K.M.

Public Health Consequences of E-Cigarettes January 23, 2018 1 2 Committee David L. Eaton

Efficiency Tricks for Neural Nets Graham Neubig Site https://phontron.com/class/nn4nlp2020/

Columbia District 2015 Leadership Development Training Finance and Stewardship November 15, 2015

Intermediate Representation With the fully analyzed program expressed as an annotated AST, its

Explanatory Session for Fiscal Year Ended March 2006 June 2006 Leopalace21 Corporation This

Data-Parallel Halo Finding with Variable Linking Lengths Conference Paper November 2014 DOI:

Oil and Gas E&amp;P Restructurings and Bankruptcies: Navigating the Unique Issues WEDNESDAY,

Consolidated Financial Statements With Independent Auditors Report Years Ended December 31,

Consolidated Financial Statements With Independent Auditors Report Years Ended December 31,

International tax impact of Ind AS Bhaumik Goda November 2018 BGSS &amp; Associates Chartered

ASC 740 CURRENT ISSUES Practical Issues Impacting Income Tax Provisions BDO US A, LLP, a

Clothing shorts Clothing shorts Charles M. Jones Columbia Business School September 18, 2008 He who

Face Detection and Recognition Readings: Ch 8: Sec 4.4, Ch 14: Sec 4.4 Bakic flesh finder

Restrictions by Object within Article 102 TFEU and a comment on Intel An extract of my ongoing

CMC ASPECTS OF GENE THERAPY MEDICINAL PRODUCTS SME workshop: Focus on chemistry, manufacturing

2010 YE Investor Presentation The information contained herein has been prepared by Trk

I m pl em ent at i on of af f or dabl e m i cr o i r r i gat i on syst em s

MACKENNA'S GOLD Press Conference November 6, 2013 Terrific Place To Be We've seen the video

Copper and Zinc Production, Disciplined Growth. C O R P O R A T E P R E S E N T A T I O N |

A NASDAQ Traded Company - Symbol HBNC Forward-Looking Statements This presentation may contain

INVESTOR PRESENTATION Q2|20 May 26, 2020 CAUTION REGARDING FORWARD-LOOKING STATEMENTS From

RAY MON D JAME S E ME RG IN G BAN K SY MPOSIUM S E P T E M B E R 2 0 1 9 P R I V A T E &amp;

Oil and Gas E&P Restructurings and Bankruptcies: Navigating the Unique Issues WEDNESDAY,

International tax impact of Ind AS Bhaumik Goda November 2018 BGSS & Associates Chartered

RAY MON D JAME S E ME RG IN G BAN K SY MPOSIUM S E P T E M B E R 2 0 1 9 P R I V A T E &