data storage revolution relational databases object
play

Data Storage Revolution Relational Databases Object Storage - PowerPoint PPT Presentation

Data Storage Revolution Relational Databases Object Storage (put/get) Speed Dynamo Scalability PNUTS Availability CouchDB Throughput MemcacheDB No Complexity Cassandra Eventual Consistency Read Request Write


  1. Data Storage Revolution • Relational Databases • Object Storage (put/get) Speed – Dynamo Scalability – PNUTS Availability – CouchDB Throughput – MemcacheDB No Complexity – Cassandra

  2. Eventual Consistency Read Request Write Request 
 Replica 
 Replica A 
 Manager 
 Replica 
 Replica B Read Request

  3. Eventual Consistency • Writes ordered after commit • Reads can be out-of-order or stale • Easy to scale, high throughput • Difficult application programming model

  4. Traditional Solution to Consistency Two-Phase 
 Replica Write Request Commit: 
 Replica 1. Prepare 
 Manager 2. Vote: Yes 3. Commit 
 Replica 
 Replica 4. Ack

  5. Strong Consistency • Reads and Writes strictly ordered • Easy programming • Expensive implementation • Doesn’t scale well

  6. Our Goal • Easy programming • Easy to scale, high throughput

  7. Chain Replication van Renesse & W1 W1 Schneider R1 R1 (OSDI 2004) W2 R2 R2 W2 R3 R3 
 Replica Write Request Read Request 
 Replica 
 Manager HEAD 
 TAIL 
 
 Replica 
 Replica

  8. Chain Replication • Strong consistency • Simple replication • Increases write throughput • Low read throughput • Can we increase throughput? • Insight: – Most applications are read-heavy (100:1)

  9. CRAQ • Two states per object – clean and dirty Read Request Read Request Read Request Read Request Read Request 
 
 
 
 
 HEAD Replica Replica Replica TAIL V 1 V 1 V 1 V 1 V 1

  10. CRAQ • Two states per object – clean and dirty • If latest version is clean , return value • If dirty , contact tail for latest version number Read Request Read Request Write Request 2 1 V 2 V 1 V 1 
 
 
 
 
 HEAD Replica Replica Replica TAIL V 1 V 2 ,V 2 V 2 ,V 2 V 2 V 1 ,V 2 V 1 V 2 ,V 2 V 2 V 1 V 1

  11. Multicast Optimizations • Each chain forms group • Tail multicasts ACKs 
 
 
 
 
 HEAD Replica Replica Replica TAIL V 2 V 1 ,V 2 V 1 V 2 ,V 2 V 2 V 1 ,V 2 V 1 V 2 ,V 2 V 2

  12. Multicast Optimizations • Each chain forms group • Tail multicasts ACKs • Head multicasts write data Write Request 
 
 
 
 
 HEAD Replica Replica Replica TAIL V 2 ,V 3 V 2 ,V 3 V 2 ,V 3 V 2 ,V 3 V 2 V 3 ,V 3

  13. CRAQ Benefits • From Chain Replication – Strong consistency – Simple replication – Increases write throughput • Additional Contributions – Read throughput scales : • Chain Replication with Apportioned Queries – Supports Eventual Consistency

  14. High Diversity • Many data storage systems assume locality – Well connected, low latency • Real large applications are geo-replicated – To provide low latency – Fault tolerance (source: Data Center Knowledge)

  15. Multi-Datacenter CRAQ DC1 HEAD TAIL DC3 Replica Replica TAIL Replica Replica Replica Replica Replica DC2

  16. Multi-Datacenter CRAQ DC1 HEAD TAIL DC3 Replica Replica Client Replica Replica Client Replica Replica Replica DC2

  17. Chain Configuration Motivation Solution 1. Specify chain size 1. Popular vs. scarce objects 2. List datacenters 2. Subset relevance - dc 1 , dc 2 , … dc N 3. Separate sizes 3. Datacenter diversity – dc 1 , chain_size 1 , … 4. Specify master 4. Write locality

  18. Master Datacenter DC1 Writer HEAD TAIL Replica TAIL Replica Replica Replica DC3 Replica HEAD Replica Replica DC2

  19. Implementation • Approximately 3,000 lines of C++ • Uses Tame extensions to SFS asynchronous I/O and RPC libraries • Network operations use Sun RPC interfaces • Uses Yahoo’s ZooKeeper for coordination

  20. Coordination Using ZooKeeper • Stores chain metadata • Monitors/notifies about node membership DC2 DC1 CRAQ CRAQ CRAQ CRAQ ZooKeeper CRAQ ZooKeeper CRAQ ZooKeeper DC3 CRAQ CRAQ CRAQ

  21. Evaluation • Does CRAQ scale vs. CR? • How does write rate impact performance? • Can CRAQ recover from failures ? • How does WAN effect CRAQ? • Tests use Emulab network emulation testbed

  22. Read Throughput as Writes Increase CRAQ − 7 15000 7x- CRAQ − 3 CR − 3 10000 Reads/s 3x- 5000 1x- 0 0 20 40 60 80 100 Writes/s

  23. Failure Recovery (Read Throughput) 60000 40000 Reads/s 20000 Length 7 Length 5 Length 3 0 0 10 20 30 40 50 Time (s)

  24. Failure Recovery (Latency) 5000 1.5 Read Latency (ms) Write Latency (ms) 1.0 3000 0.5 1000 0.0 0 0 10 20 0 10 20 Time (s) Time (s)

  25. Geo-replicated Read Latency 80 Mean Latency (ms) 60 40 20 CR CRAQ 0 0 5 10 15 20 Writes/s

  26. If Single Object Put/Get Insufficient • Test-and-Set, Append, Increment – Trivial to implement – Head alone can evaluate • Multiple object transaction in same chain – Can still be performed easily – Head alone can evaluate • Multiple chains – An agreement protocol (2PC) can be used – Only heads of chains need to participate – Although degrades performance (use carefully!)

  27. Summary • CRAQ Contributions? – Challenges trade-off of consistency vs. throughput • Provides strong consistency • Throughput scales linearly for read-mostly • Support for wide-area deployments of chains • Provides atomic operations and transactions Thank Questions? You

Recommend


More recommend