Distributed Databases Instructor: Matei Zaharia cs245.stanford.edu

Outline Replication strategies Partitioning strategies AC & 2PC CAP Avoiding coordination Parallel query execution CS 245 2

Atomic Commitment Informally: either all participants commit a transaction, or none do “participants” = partitions involved in a given transaction CS 245 3

So, What’s Hard? All the problems as consensus… …plus, if any node votes to abort , all must decide to abort » In consensus, simply need agreement on “some” value CS 245 4

Two-Phase Commit Canonical protocol for atomic commitment (developed 1976-1978) Basis for most fancier protocols Widely used in practice Use a transaction coordinator » Usually client – not always! CS 245 5

Two Phase Commit (2PC) 1. Transaction coordinator sends prepare message to each participating node 2. Each participating node responds to coordinator with prepared or no 3. If coordinator receives all prepared : » Broadcast commit 4. If coordinator receives any no: » Broadcast abort CS 245 6

Informal Example Matei Got a table for 3 tonight? Pizza tonight? Pizza tonight? Confirmed Yes we do e I’ll book it Confirmed r u Sure S Alice Bob PizzaSpot CS 245 7

Case 1: Commit CS 245 8 UW CSE545

Case 2: Abort UW CSE545

2PC + Validation Participants perform validation upon receipt of prepare message Validation essentially blocks between prepare and commit message CS 245 10

2PC + 2PL Traditionally: run 2PC at commit time » i.e., perform locking as usual, then run 2PC to have all participants agree that the transaction will commit Under strict 2PL, run 2PC before unlocking the write locks CS 245 11

2PC + Logging Log records must be flushed to disk on each participant before it replies to prepare » The participant should log how it wants to respond + data needed if it wants to commit CS 245 12

2PC + Logging Example Participant 1 read, write, etc <T1, Obj1, …> ← log records <T1, Obj2, …> Coordinator Participant 2 <T1, Obj3, …> <T1, Obj4, …> CS 245 13

2PC + Logging Example Participant 1 e r a p e r p y d <T1, Obj1, …> ← log records a e r <T1, Obj2, …> Coordinator <T1, ready> p r e p <T1, commit> a r ready e Participant 2 <T1, Obj3, …> <T1, Obj4, …> <T1, ready> CS 245 14

2PC + Logging Example Participant 1 t i m m o c e n <T1, Obj1, …> ← log records o d <T1, Obj2, …> Coordinator <T1, ready> c <T1, commit> o m <T1, commit> m i done t Participant 2 <T1, Obj3, …> <T1, Obj4, …> <T1, ready> <T1, commit> CS 245 15

Optimizations Galore Participants can send prepared messages to each other: » Can commit without the client » Requires O(P 2 ) messages Piggyback transaction’s last command on prepare message 2PL: piggyback lock “unlock” commands on commit / abort message CS 245 16

What Could Go Wrong? Coordinator PREPARE Participant Participant Participant CS 245 17

What Could Go Wrong? Coordinator What if we don’t PREPARED PREPARED hear back? Participant Participant Participant CS 245 18

Case 1: Participant Unavailable We don’t hear back from a participant Coordinator can still decide to abort » Coordinator makes the final call! Participant comes back online? » Will receive the abort message CS 245 19

What Could Go Wrong? Coordinator does not reply! PREPARED PREPARED PREPARED Participant Participant Participant CS 245 21

Case 2: Coordinator Unavailable Participants cannot make progress But: can agree to elect a new coordinator, never listen to the old one (using consensus) » Old coordinator comes back? Overruled by participants, who reject its messages CS 245 22

What Could Go Wrong? Coordinator does not reply! No contact with third PREPARED PREPARED participant! Participant Participant Participant CS 245 24

Case 3: Coordinator and Participant Unavailable Worst-case scenario: » Unavailable/unreachable participant voted to prepare » Coordinator hears back all prepare , broadcasts commit » Unavailable/unreachable participant commits Rest of participants must wait!!! CS 245 25

Other Applications of 2PC The “participants” can be any entities with distinct failure modes; for example: » Add a new user to database and queue a request to validate their email » Book a flight from SFO -> JFK on United and a flight from JFK -> LON on British Airways » Check whether Bob is in town, cancel my hotel room, and ask Bob to stay at his place CS 245 26

Coordination is Bad News Every atomic commitment protocol is blocking (i.e., may stall) in the presence of: » Asynchronous network behavior (e.g., unbounded delays) • Cannot distinguish between delay and failure » Failing nodes • If nodes never failed, could just wait Cool: actual theorem! CS 245 27

Outline Replication strategies Partitioning strategies AC & 2PC CAP Avoiding coordination Parallel processing CS 245 28

Eric Brewer CS 245 29

Asynchronous Network Model Messages can be arbitrarily delayed Can’t distinguish between delayed messages and failed nodes in a finite amount of time CS 245 30

CAP Theorem In an asynchronous network, a distributed database can either: » guarantee a response from any replica in a finite amount of time (“availability”) OR » guarantee arbitrary “consistency” criteria/constraints about data but not both CS 245 31

CAP Theorem Choose either: » Consistency and “Partition Tolerance” » Availability and “Partition Tolerance” Example consistency criteria: » Exactly one key can have value “Matei” “CAP” is a reminder: » No free lunch for distributed systems CS 245 32

Why CAP is Important Pithy reminder: “consistency” (serializability, various integrity constraints) is expensive! » Costs us the ability to provide “always on” operation (availability) » Requires expensive coordination (synchronous communication) even when we don’t have failures CS 245 34

Let’s Talk About Coordination If we’re “AP”, then we don’t have to talk even when we can! If we’re “CP”, then we have to talk all the time How fast can we send messages? CS 245 35

Let’s Talk About Coordination If we’re “AP”, then we don’t have to talk even when we can! If we’re “CP”, then we have to talk all the time How fast can we send messages? » Planet Earth: 144ms RTT • (77ms if we drill through center of earth) » Einstein! CS 245 36

Multi-Datacenter Transactions Message delays often much worse than speed of light (due to routing) 44ms apart? maximum 22 conflicting transactions per second » Of course, no conflicts, no problem! » Can scale out Pain point for many systems CS 245 37

Do We Have to Coordinate? Is it possible achieve some forms of “correctness” without coordination? CS 245 38

Do We Have to Coordinate? Example: no user in DB has address=NULL » If no replica assigns address=NULL on their own, then NULL will never appear in the DB! Whole topic of research! » Key finding: most applications have a few points where they need coordination, but many operations do not CS 245 39

So Why Bother with Serializability? For arbitrary integrity constraints, non- serializable execution can break constraints Serializability: just look at reads, writes To get “coordination-free execution”: » Must look at application semantics » Can be hard to get right! » Strategy: start coordinated, then relax CS 245 40

Punchlines: Serializability has a provable cost to latency, availability, scalability (if there are conflicts) We can avoid this penalty if we are willing to look at our application and our application does not require coordination » Major topic of ongoing research CS 245 41

Outline Replication strategies Partitioning strategies AC & 2PC CAP Avoiding coordination Parallel query execution CS 245 42

Avoiding Coordination Several key techniques; e.g. BASE ideas » Partition data so that most transactions are local to one partition » Tolerate out-of-date data (eventual consistency): • Caches • Weaker isolation levels • Helpful ideas: idempotence, commutativity CS 245 43

Example from BASE Paper Constraint: each user’s amt_sold and amt_bought is sum of their transactions ACID Approach: to add a transaction, use 2PC to update transactions table + records for buyer, seller One BASE approach: to add a transaction, write to transactions table + a persistent queue of updates to be applied later CS 245 44

Example from BASE Paper Constraint: each user’s amt_sold and amt_bought is sum of their transactions ACID Approach: to add a transaction, use 2PC to update transactions table + records for buyer, seller Another BASE approach: write new transactions to the transactions table and use a periodic batch job to fill in the users table CS 245 45

Distributed Databases Instructor: Matei Zaharia cs245.stanford.edu - PowerPoint PPT Presentation

Distributed Databases Instructor: Matei Zaharia cs245.stanford.edu Outline Replication strategies Partitioning strategies AC & 2PC CAP Avoiding coordination Parallel query execution CS 245 2 Atomic Commitment Informally: either all

Creating Databases and Tables Introduction to Databases in Python Creating Databases

Inductive Inductive Inductive Inductive Databases Databases Databases Databases and

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Distributed Databases Distributed database management system A distributed database (DDB) is

DISTRIBUTED DATABASES CHAPTER 25 LECTURE OVERVIEW What are distributed databases?

Module 3: Creating and Managing Databases Overview Creating Databases Creating

CS377: Database Systems Distributed Databases Distributed Databases

GEMS/Food Databases and GEMS/Food Databases and GEMS/Food Databases and in the Food Supply

Image Databases Image Databases Image Databases Prof. Paolo Ciaccia Prof. Paolo Ciaccia

Lecture 10: Larger-than-Memory Databases 1 / 53 Larger-than-Memory Databases Recap

Databases and PHP Accessing databases from PHP PHP & Databases l PHP can connect to

3. Text and document databases Normal databases: formatted records; document databases:

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Introduction to Hadoop 1 Distributed Data Processing The idea of distributed databases is older

Distributed Databases 1 19.1 Distributed Database System A distributed database system

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Building Spanner Better clocks stronger semantics Alex Lloyd Senior Staff Software Engineer

Zerocoin: Anonymous Distributed E-Cash from Bitcoin Ian Miers , Christina Garman, Matthew Green,

Microservice Splitting the Monolith Software Engineering II Sharif University of Technology

Environments Costas Busch Louisiana State University (Joint work with Gokarna Sharma) WTTM 2013

Overview Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad:

Non-Blocking Two Phase Commit (2PC) Using Blockchain Paul Ezhilchelvan , Amjad Aldweesh and Aad

Scaling the Relational Database for the Cloud Age Sumedh Pathak, Co-Founder & VP Engineering,

Chapter 2 Basic Concepts Contents Parallel computing. Concurrency. Parallelism