Surviving congestion in geo-distributed storage systems Brian Cho University of Illinois at Urbana-Champaign Marcos K. Aguilera Microsoft Research Silicon Valley
Geo-distributed data centers • Web applications increasingly deployed across geo-distributed data centers – e.g., social networks, online stores, messaging • App data replicated across data centers – Disaster tolerance – Access locality 2
Congestion between geo-distributed data centers • Limited bandwidth between data centers – e.g., leased lines, MPLS VPN – Bandwidth is expensive: ~1K $/Mbps [SprintMPLS] – Provision for typical (not peak) usage • Many machines in each data center 3
Congestion → Delay between geo-distributed data centers • Congestion can cause significant delays – TCP messaging increases to order-of-seconds (Figure) – Observed across Amazon EC2 data centers [Kraska et al] • Users do not tolerate delays (<1s) [Nielsen] FIGURE: RPC round trip delay under congestion (10-30s) 4
Replication techniques applied to geo-distributed data centers • Weak consistency – e.g., Amazon Dynamo, Yahoo PNUTS, COPS – Good performance: updates can be propagated asynchronously – Semantics undesirable in some cases (e.g., writes get re-ordered across replicas) • Strong consistency – e.g., ABD, Paxos, available in Google Megastore, Amazon SimpleDB – Avoids the many problems of weak consistency – Must wait for updates to propagate across data centers – App delay requirements difficult to meet under congestion 5
Contributions • Vivace: a strongly consistent key-value store that is resilient to congestion across geo-distributed data centers • Approach – New algorithms send small amount of critical information across data centers in separate prioritized messages • Challenges – Still provide strong consistency – Keep prioritized messages small – Avoid delay overhead in absence of congestion 6
Vivace algorithms • Enhance previous strongly consistent algorithms • Prioritize small amount of critical information across sites 7
Vivace algorithms • Enhance previous strongly consistent algorithms • Prioritize small amount of critical information across sites Two algorithms: 1. Read/write algorithm – Very simple – Based on traditional quorum algorithm [ABD] – Linearizable read() and write() – read() contains a write-back phase 2. State machine replication algorithm – More complex, details in paper 8
Traditional quorum algorithm: write val is large (compared with key & ts) 1 <WRITE,key, val ,ts> Client Replica 1 Replica 2 Replica 3 9
Traditional quorum algorithm: write 1 <WRITE,key, val ,ts> Client <ACK-WRITE> Replica 1 Replica 2 Replica 3 10
Traditional quorum algorithm: write write done 1 <WRITE,key, val ,ts> Client <ACK-WRITE> Replica 1 Replica 2 Replica 3 11
Traditional quorum algorithm: read 1 <READ,key> Client Replica 1 Replica 2 Replica 3 12
Traditional quorum algorithm: read large val 1 <READ,key> Client <ACK-READ, val ,ts> Replica 1 Replica 2 Replica 3 13
Traditional quorum algorithm: read writeback: ensures strong consistency (linearizability) 1 large val, again! Client 2 <WRITE,key, val ,ts> Replica 1 Replica 2 Replica 3 14
Traditional quorum algorithm: read writeback: ensures strong consistency (linearizability) 1 large val, again! Client 2 <WRITE,key, val ,ts> <ACK-WRITE> Replica 1 Replica 2 Replica 3 15
Traditional quorum algorithm: read read done 1 Client 2 Replica 1 Replica 2 Replica 3 16
Vivace: write Client Replica 1 Replica 2 Replica 3 Local Local Local Replica 1 Replica 2 Replica 3 new quorum of local replicas 17
Vivace: write Client val sent locally Replica 1 Replica 2 Replica 3 1 <W-LOCAL,key, val ,ts> Local Local Local Replica 1 Replica 2 Replica 3 18
Vivace: write Client Replica 1 Replica 2 Replica 3 1 <W-LOCAL,key, val ,ts> <ACK-W-LOCAL> Local Local Local Replica 1 Replica 2 Replica 3 19
Vivace: write no val : prioritize small message! 2 <W-TS,key,ts> Client Replica 1 Replica 2 Replica 3 1 Local Local Local Replica 1 Replica 2 Replica 3 20
Vivace: write no val : prioritize small message! 2 <W-TS,key,ts> Client <ACK-W-TS> Replica 1 Replica 2 Replica 3 1 Local Local Local Replica 1 Replica 2 Replica 3 21
Vivace: write Replica 1,2,3 have a consistent view of key & ts, write but no val (yet) done 2 Client Replica 1 Replica 2 Replica 3 1 Local Local Local Replica 1 Replica 2 Replica 3 22
Vivace: write Replica 1,2,3 add val to their consistent view of key & ts 2 Client Replica 1 Replica 2 Replica 3 1 * <W-REMOTE,key, val ,ts> val is still large, Local Local Local Replica 1 Replica 2 Replica 3 but not in critical path 23
write comparison 1 Client Replica 1 Replica 2 Replica 3 Traditional algorithm: 1 remote RTT 2 Client 1 Replica 1 Replica 2 Replica 3 * Vivace algorithm: Local Local Local 1 prioritized remote RTT + Replica 1 Replica 2 Replica 3 1 local RTT 24
Vivace: read prioritize only ask for ts 1 <R-TS,key> Client Replica 1 Replica 2 Replica 3 Local Local Local Replica 1 Replica 2 Replica 3 25
Vivace: read prioritize small message 1 <R-TS,key> Client <ACK-R-TS,ts> Replica 1 Replica 2 Replica 3 Local Local Local Replica 1 Replica 2 Replica 3 26
Vivace: read ask for data 1 with largest ts Client 2 <R-DATA,key,ts> Replica 1 Replica 2 Replica 3 Local Local Local Replica 1 Replica 2 Replica 3 27
Vivace: read 1 Client 2 <R-DATA,key,ts> <ACK-R-DATA, val > Replica 1 Replica 2 Replica 3 large val , but wait for only one reply (common case: local) Local Local Local Replica 1 Replica 2 Replica 3 28
Vivace: read prioritize 1 Client 2 <W-TS,key,ts> 3 Replica 1 Replica 2 Replica 3 writeback only small ts Local Local Local Replica 1 Replica 2 Replica 3 29
Vivace: read prioritize 1 Client 2 <W-TS,key,ts> 3 Replica 1 Replica 2 Replica 3 <ACK-W-TS> Local Local Local Replica 1 Replica 2 Replica 3 30
Vivace: read read done 1 Client 2 3 Replica 1 Replica 2 Replica 3 Local Local Local Replica 1 Replica 2 Replica 3 31
read comparison Client 1 2 Replica 1 Replica 2 Replica 3 Traditional algorithm: 2 remote RTTs 1 Client 2 3 Replica 1 Replica 2 Replica 3 Vivace algorithm: 2 prioritized remote RTT + 1 local RTT 32
Evaluation topics • Practical prioritization setup • Delay with congestion – KV-store operations – Twitter clone web app operations • Delay without congestion – Overhead of Vivace algorithms compared to traditional algorithms 33
Evaluation setup • Local cluster <-> Amazon EC2 Ireland • DSCP bit prioritization on local router’s egress port • Congestion generated with iperf prioritization applied here only Local cluster Amazon EC2 (Illinois) (Ireland) 34
Evaluation Does prioritization work in practice? • Simple ping experiment • Prioritized messages bypass congestion • Local router-based prioritization is effective 35
Evaluation How well does Vivace perform under congestion? KV-store operations Twitter-clone operations (a) Read algorithms (a) Post tweet (b) Write algorithms (b) Read user timeline (c) State machine algorithms (c) Read friends timeline 36
Evaluation How well does Vivace perform under congestion? 2 prioritized remote RTTs 2 remote RTTs + 1 local RTT avoids congestion delays TCP resend on packet loss buffering delay (a) Read algorithms 37
Evaluation What is the overhead of Vivace without congestion? • (Results in paper) • No measurable overhead compared to traditional algorithms • Extra message phases are not harmful 38
Conclusion • Proposed two new algorithms – Read/write (simple, in talk) – State machine (more complex, in paper) • Both algorithms avoid delay due to congestion by prioritizing a small amount of critical information, while – Still providing strong consistency – Keeping prioritized messages small – Avoiding delay overhead in absence of congestion – Using a practical prioritization infrastructure • Careful use of prioritized messages can be an effective strategy in geo-distributed data centers 39
Recommend
More recommend