Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems Guoxin Liu, Haiying Shen and Haoyu Wang Holcombe Department of Electrical and Computer Engineering Clemson University Presented by Haoyu Wang
Outline 1, Introduction 2, System design 3, Performance evaluation 4, Conclusion
Introduction Background (Clemson Palmetto Clusters) Load Balancing Problem I/O load Why not consider the computing Data storage workload? …...
Introduction Previous work Previous work • Challenge for load • Related work balancing – Random data allocation – Data locality – Balancing the number of data blocks – Task delay – Balancing the I/O load – Long-term load balance – Cost-efficient & scalable
System Design Main contribution 1, Trace analysis on computing workloads 2, Computing load aware long-view load balancing method 3, Trace-driven experiments
System Design Trace Data Analysis 100% 80% CDF 60% 40% 20% Click to edit subtitle style 0% 1 10 100 1000 Task running time (s)
System Design Trace Data Analysis 100% 100% 80% 80% CDF 60% CDF 60% 40% 40% 20% 20% Click to edit subtitle style 0% 0% 1 10 100 1000 Task running time (s) 0 20000 40000 60000 Number of currently submitted tasks
System Design Trace Data Analysis 100% 100% 80% 100% 80% CDF 60% CDF 95% 60% 40% CDF 40% 90% 20% 20% 85% 0% Click to edit subtitle style 0% 1 10 100 1000 80% Task running time (s) 0 20000 40000 60000 0 10000 20000 30000 40000 Number of currently submitted tasks Number of currently submitted tasks from different jobs
System Design Trace Data Analysis 100% 100% 80% 100% 80% 100% CDF 60% CDF 60% 95% 80% 40% CDF 40% 90% CDF 60% 20% 20% 85% Click to edit subtitle style 40% 0% 0% 1 10 100 1000 80% 20% Task running time (s) 0 20000 40000 60000 0 10000 20000 30000 40000 Number of currently submitted tasks 0% Number of currently submitted tasks 0 10 20 30 40 from different jobs Num. of data transmissions of a server
System Design Trace Data Analysis 100% 100% 80% 100% 80% 100% CDF 60% CDF 60% 95% 100% 80% 40% CDF 40% 80% 90% CDF 60% 20% 20% 60% 85% CDF Click to edit subtitle style 40% 0% 0% 1 40% 10 100 1000 80% 20% Task running time (s) 0 20000 40000 60000 0 10000 20000 30000 40000 20% Number of currently submitted tasks 0% Number of currently submitted tasks 0 10 20 30 40 0% from different jobs Num. of data transmissions of a server 0 20000 40000 60000 80000 Waiting time of a task (s)
System Design CALV System Overview Coefficient-based data reallocation Principle 1: The data blocks contributing more computing workloads at more overloaded epochs in the spatial space and temporal space have a higher priority to be selected to reallocate Principle2: Among all data blocks contributing workloads at an overloaded epoch, the data blocks contribute less workload at more underloaded epochs have a higher priority to be selected to reallocate.
System Design CALV System Overview Coefficient-based data reallocation : Computing capacity of the server d 3 d 4 d 3 d 3 d 3 d 2 d 2 d 4 d 2 d 2 d 5 d 6 d 2 d 1 d 1 d 6 d 5 d 7 d 2 d 3 d 1 d 1 e 1 e 2 e 1 e 2 e 1 e 2 e 3 e 3 e 3 S i S j S k (a) Reduce num. of reported (b) Reduce num. of reported (c) Avoid server underload data blocks in spatial space data blocks in temporal space Selection of data block to reallocate
System Design CALV System Overview Lazy Data Block Transmission : Computing capacity d 3 d 3 d 2 d 4 d 2 d 2 d 3 d 1 d 1 d 5 d 1 d 1 d 5 d 5 d 5 e 2 e 1 e 3 e 2 e 4 e 1 e 3 e 4 S i S j Lazy data block transmission
Performance Evaluation Trace-driven experiments Simulated environment: 3000 servers with typical fat-tree topology. 8 computing slots for each server Epoch set to 1 second Comparison method: Random, Sierra, Ursa, CA
Performance Evaluation Trace-driven experiments Performance of Data locality Random Sierra Ursa CA CALV 120 compared to Random % of network load 100 80 60 40 20 0 0.5 0.75 1 1.25 1.5 Random Sierra Ursa CA CALV x times of num. of jobs 120 compared to Random % of network load 100 80 60 40 20 0 0.5 0.75 1 1.25 1.5 x times of num. of jobs
Performance Evaluation Trace-driven experiments Performance of Task Latency Random=0 Sierra Ursa CA CALV 50 Reduced avg. latency 40 per task (s) 30 20 10 0 0.5 0.75 1 1.25 1.5 Random=0 Sierra Ursa CA CALV x times of num. of jobs 50 Reduced avg. latency 40 per task (s) 30 20 10 0 0.5 0.75 1 1.25 1.5 x times of num. of jobs
Performance Evaluation Trace-driven experiments Performance of Cost-Efficiency CALV CALV-MAX CALV-Random CALV-All 3.E+7 Num. of reported 2.E+7 blocks 2.E+7 1.E+7 5.E+6 Performance of Lazy Data 0.E+0 transmission 0.5 0.75 1 1.25 1.5 x times of num. of jobs Saved % of network load 1024 Saved % of peak num. of reallocated blocks 256 Reduced num. of overloads (*20) 64 16 4 1 0.5 0.75 1 1.25 1.5 x times of num. of jobs
Conclusion Conclusion The importance of considering the computing workloads CALV is cost-efficient and could get long-term load balance
The End Thanks! Questions?
Recommend
More recommend