Fast and Accurate Load Balancing for Geo-Distributed Storage Systems Kirill L. Bogdanov 1 Waleed Reda 1,2 Gerald Q. Maguire Jr. 1 Dejan Kostic 1 Marco Canini 3 1 KTH Royal Institute of Technology 2 Université Catholique de Louvain 3 KAUST
Geo-Distributed Services Service Level Objective (SLO): Clients Request completion time at the target percentile (e.g., 30 ms at 95 th percentile) Datacenter 2
Geo-Distributed Services Web-based services demonstrate Clients temporal and spatial variability in load Datacenter Problem: it is difficult to meet strict SLOs, while maintaining high resource utilization and low cost 3
Approach 1 - Datacenter Elasticity [1000x req/s] Arrival rate 4
Approach 1 - Datacenter Elasticity [1000x req/s] Arrival rate 5
Approach 1 - Datacenter Elasticity [1000x req/s] Arrival rate 6
Approach 1 - Datacenter Elasticity Provisioning delays Lead to [1000x req/s] Unused capacity Arrival rate overprovisioning! SLO violations Provisioning delay (minutes) due to time needed to spawn and warm up a VM Hard to predict workload far into the future Load spikes can be short lived 7
Approach 2 - Geo-Distributed Load Balancing Redirection Redirection delay Excessive [1000x req/s] Arrival rate redirection delays Inaccurate SLO violations response time How estimation much to [1000x req/s] Arrival rate redirect? Excessive or insufficient redirection 8
Our Approach: Kurma Reacts to changes in load within [1000x req/s] seconds Arrival rate Avoids unnecessary scaling out Accurately estimates remote rate of SLO violations [1000x req/s] Arrival rate Tames SLO violation at the target level 9
Request Completion Time Datacenter Frankfurt Datacenter Ireland Server 2 Server 1 Wide Area Network Base Propagation: Stable Delay Variance: Variable Service Time: component associated with component associated Variable component packet propagation along a with competing traffic associated with network path and queuing load on the server Kurma solves global optimization model while considering: Base Propagation + Delay Variance + Service Time at all datacenters 10
Understanding Service Time Datacenter Frankfurt 5 Server Cassandra cluster 11
Understanding Service Time Datacenter Frankfurt 5 Server Cassandra cluster Challenge: How to accurately estimate remote fraction of SLO violations at runtime under variable network conditions? 12
Understanding Service Time Datacenter Frankfurt 5 Server Cassandra cluster 13
Understanding Service Time Datacenter Frankfurt Datacenter Ireland 5 Server Cassandra cluster 5 Server Cassandra cluster Wide Area Network 7000 Insight: the farther away a remote datacenter is, the less loaded it should be to serve remote requests within a given SLO target 14
Understanding WAN Latency Base propagation delay Monte Carlo Simulations Service time distribution recorded locally at a specific load 15
Understanding WAN Latency Base propagation delay Monte Carlo Simulations Service time distribution recorded locally at a specific load 16
Understanding WAN Latency Base propagation delay Monte Carlo Simulations Service time distribution Gives SLO violation rate recorded locally at a given a specific load specific load and WAN conditions 17
Understanding WAN Latency Base propagation delay Estimation Error 18
Incorporating WAN and Load 19
Incorporating WAN and Load 20
Incorporating WAN and Load
Incorporating WAN and Load 22
Optimisation Model Runtime load in each + datacenter { λ 1 , λ 2 , λ 3 } Optimisation Problem ✓ Minimize global SLO violations (KurmaPerf) ✓ Minimize the cost of running a service (KurmaCost) 23
Implementation Global View: Each Epoch latencies + 2.5 sec → 0.4Hz loads Perform run-time WAN latency measurements Aggregate load information (rates of requests) Datacenter Stockholm … Exchange metrics to obtain global view Solve decentralized … performance model Datacenter London Datacenter Frankfurt 24
Implementation Each Epoch 2.5 sec → 0.4Hz Perform run-time WAN latency measurements Aggregate load information (rates of requests) Datacenter Stockholm Exchange metrics to obtain global view Solve decentralized performance model Datacenter London Enforce computed rates of Datacenter requests redirection Frankfurt 25
Evaluation Setup Geo-distributed Cassandra cluster • 3 Amazon EC2 datacenter (Ireland, Frankfurt, London) • 5 x r5.large VMs per datacenter SLO: 30 ms at the 95 th percentile • • Modified YCSB to replay workload traces (World Cup http://ita.ee.lbl.gov/html/contrib/WorldCup.html) Experiments: • Minimizing SLO violations for reads • Maintaining Target SLO (accuracy) • Cost Savings for 1 min billing intervals (simulations) • Reads and writes, scalability, etc. link here. 26
Workload Trace No elastic scaling Load threshold for 5% SLO violations 27
Cumulative Normalized SLO Violations Kurma’s SLO violations are at 2.4% The numbers shown above the bars indicate the amount of inter-datacentre traffic transferred, whiskers → 75 th percentile 28
Cumulative Normalized SLO Violations Kurma’s SLO violations are at 2.4% The numbers shown above the bars indicate the amount of inter-datacentre traffic transferred, whiskers → 75 th percentile 29
Average Provisioning Cost Over 30 Consecutive Days Total Cost [US$] Per Day All Shared All local WAN latency = 0ms Bandwidth cost = 0$ - Reactive threshold based elastic controller - Minimum billing period of 1 minute - Results obtained using simulations 30
Average Provisioning Cost Over 30 Consecutive Days Total Cost [US$] Per Day All Shared KurmaCost KurmaPerf All local WAN latency = 0ms Bandwidth cost = 0$ Keeps SLO violations under 5% - Reactive threshold based elastic controller Minimize SLO violations - Minimum billing period of 1 minute (minimize redirections while (no consideration for traffic usage) - Results obtained using simulations avoiding scaling out) 31
Taming SLO Violations Under Elastic No elastic Threshold scaling 32
Conclusion Kurma – fast and accurate load balancer for geo-distributed systems that takes advantage of spatial variability in load Decouples end-to-end response time into components of base propagation latency, network congestion, and service time distribution By operating at the granularity of a few seconds, Kurma reduces SLO violations or lowers the costs of running services by avoiding excessive global service overprovisioning 33 Contact: KIRILLB@kth.se
Recommend
More recommend