computing load aware
play

Computing Load Aware and Long-View Load Balancing for Cluster - PowerPoint PPT Presentation

Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems Guoxin Liu, Haiying Shen and Haoyu Wang Holcombe Department of Electrical and Computer Engineering Clemson University Presented by Haoyu Wang Outline 1,


  1. Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems Guoxin Liu, Haiying Shen and Haoyu Wang Holcombe Department of Electrical and Computer Engineering Clemson University Presented by Haoyu Wang

  2. Outline 1, Introduction 2, System design 3, Performance evaluation 4, Conclusion

  3. Introduction Background (Clemson Palmetto Clusters) Load Balancing Problem I/O load Why not consider the computing Data storage workload? …...

  4. Introduction Previous work Previous work • Challenge for load • Related work balancing – Random data allocation – Data locality – Balancing the number of data blocks – Task delay – Balancing the I/O load – Long-term load balance – Cost-efficient & scalable

  5. System Design Main contribution 1, Trace analysis on computing workloads 2, Computing load aware long-view load balancing method 3, Trace-driven experiments

  6. System Design Trace Data Analysis 100% 80% CDF 60% 40% 20% Click to edit subtitle style 0% 1 10 100 1000 Task running time (s)

  7. System Design Trace Data Analysis 100% 100% 80% 80% CDF 60% CDF 60% 40% 40% 20% 20% Click to edit subtitle style 0% 0% 1 10 100 1000 Task running time (s) 0 20000 40000 60000 Number of currently submitted tasks

  8. System Design Trace Data Analysis 100% 100% 80% 100% 80% CDF 60% CDF 95% 60% 40% CDF 40% 90% 20% 20% 85% 0% Click to edit subtitle style 0% 1 10 100 1000 80% Task running time (s) 0 20000 40000 60000 0 10000 20000 30000 40000 Number of currently submitted tasks Number of currently submitted tasks from different jobs

  9. System Design Trace Data Analysis 100% 100% 80% 100% 80% 100% CDF 60% CDF 60% 95% 80% 40% CDF 40% 90% CDF 60% 20% 20% 85% Click to edit subtitle style 40% 0% 0% 1 10 100 1000 80% 20% Task running time (s) 0 20000 40000 60000 0 10000 20000 30000 40000 Number of currently submitted tasks 0% Number of currently submitted tasks 0 10 20 30 40 from different jobs Num. of data transmissions of a server

  10. System Design Trace Data Analysis 100% 100% 80% 100% 80% 100% CDF 60% CDF 60% 95% 100% 80% 40% CDF 40% 80% 90% CDF 60% 20% 20% 60% 85% CDF Click to edit subtitle style 40% 0% 0% 1 40% 10 100 1000 80% 20% Task running time (s) 0 20000 40000 60000 0 10000 20000 30000 40000 20% Number of currently submitted tasks 0% Number of currently submitted tasks 0 10 20 30 40 0% from different jobs Num. of data transmissions of a server 0 20000 40000 60000 80000 Waiting time of a task (s)

  11. System Design CALV System Overview Coefficient-based data reallocation Principle 1: The data blocks contributing more computing workloads at more overloaded epochs in the spatial space and temporal space have a higher priority to be selected to reallocate Principle2: Among all data blocks contributing workloads at an overloaded epoch, the data blocks contribute less workload at more underloaded epochs have a higher priority to be selected to reallocate.

  12. System Design CALV System Overview Coefficient-based data reallocation : Computing capacity of the server d 3 d 4 d 3 d 3 d 3 d 2 d 2 d 4 d 2 d 2 d 5 d 6 d 2 d 1 d 1 d 6 d 5 d 7 d 2 d 3 d 1 d 1 e 1 e 2 e 1 e 2 e 1 e 2 e 3 e 3 e 3 S i S j S k (a) Reduce num. of reported (b) Reduce num. of reported (c) Avoid server underload data blocks in spatial space data blocks in temporal space Selection of data block to reallocate

  13. System Design CALV System Overview Lazy Data Block Transmission : Computing capacity d 3 d 3 d 2 d 4 d 2 d 2 d 3 d 1 d 1 d 5 d 1 d 1 d 5 d 5 d 5 e 2 e 1 e 3 e 2 e 4 e 1 e 3 e 4 S i S j Lazy data block transmission

  14. Performance Evaluation Trace-driven experiments Simulated environment: 3000 servers with typical fat-tree topology. 8 computing slots for each server Epoch set to 1 second Comparison method: Random, Sierra, Ursa, CA

  15. Performance Evaluation Trace-driven experiments Performance of Data locality Random Sierra Ursa CA CALV 120 compared to Random % of network load 100 80 60 40 20 0 0.5 0.75 1 1.25 1.5 Random Sierra Ursa CA CALV x times of num. of jobs 120 compared to Random % of network load 100 80 60 40 20 0 0.5 0.75 1 1.25 1.5 x times of num. of jobs

  16. Performance Evaluation Trace-driven experiments Performance of Task Latency Random=0 Sierra Ursa CA CALV 50 Reduced avg. latency 40 per task (s) 30 20 10 0 0.5 0.75 1 1.25 1.5 Random=0 Sierra Ursa CA CALV x times of num. of jobs 50 Reduced avg. latency 40 per task (s) 30 20 10 0 0.5 0.75 1 1.25 1.5 x times of num. of jobs

  17. Performance Evaluation Trace-driven experiments Performance of Cost-Efficiency CALV CALV-MAX CALV-Random CALV-All 3.E+7 Num. of reported 2.E+7 blocks 2.E+7 1.E+7 5.E+6 Performance of Lazy Data 0.E+0 transmission 0.5 0.75 1 1.25 1.5 x times of num. of jobs Saved % of network load 1024 Saved % of peak num. of reallocated blocks 256 Reduced num. of overloads (*20) 64 16 4 1 0.5 0.75 1 1.25 1.5 x times of num. of jobs

  18. Conclusion Conclusion The importance of considering the computing workloads CALV is cost-efficient and could get long-term load balance

  19. The End Thanks! Questions?

Recommend


More recommend