evaluation of a failure prediction model for
play

Evaluation of a Failure Prediction Model for Large Scale Cloud - PowerPoint PPT Presentation

Evaluation of a Failure Prediction Model for Large Scale Cloud Applications Mohammad S. Jassas and Qusay H. Mahmoud Presentation at Canadian AI 2020 Introduction Cloud services Complexity for cloud architectures. Cloud applications


  1. Evaluation of a Failure Prediction Model for Large Scale Cloud Applications Mohammad S. Jassas and Qusay H. Mahmoud Presentation at Canadian AI 2020

  2. Introduction ▪ Cloud services → Complexity for cloud architectures. ▪ Cloud applications have a high probability of failures ▪ Most Cloud providers have experienced failure in one of their services ▪ AWS experienced failure in (EBS) [7]. ▪ Many organizations are planning to use public cloud environments. ▪ Cloud providers → Maintaining their services to provide cloud consumers with a high level of QoS). 2 [7] P. Marshall, K. Keahey, T. Freeman, Elastic site: Using clouds to elastically extend site resources, in: 10th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2010).

  3. Problem Statement • Providing 24x7 services uptime become one of the most significant challenges faces the cloud providers. • Failed jobs consume a notable amount of computational resources and memory. 3

  4. Objective High Resource wastage Reliability + Availability Decrease the number of failed tasks Increase Minimize Time + Cost the performance of Cloud apps 4

  5. Related Work ▪ Failure analysis and characterization have been studied widely in grid computing, cloud cluster and supercomputer [1]. ▪ The Google traces [3] are used in different research studies, including workload characterization [5] and applying statistical methods. ▪ In [2], we have studied the workload features such as memory usage, CPU speed, disk space. ▪ Limited research has been done on failure prediction [4,5,6]. ▪ El-Sayed et al. [4] have designed a job failure prediction model using a RF classifier. [1] Chen, X., Lu, C.D., Pattabiraman, K.: Failure analysis of jobs in compute clouds: a Google cluster case study. IEEE (2014). [4] El-Sayed, N., Zhu, H., Schroeder, B.: Learning from failure across multiple clusters: a trace-driven approach to [2] Jassas, M., Mahmoud, Q.H.: Failure analysis and characterization of scheduling jobs in Google cluster trace. IEEE (2018) understanding, predicting, and mitigating job terminations. IEEE (2017) 5 [3] Reiss, C., Wilkes, J., Hellerstein, J.L.: Google cluster-usage traces: format+ schema. Google Inc., (2011) [5] Chen, X., Lu, C.D., Pattabiraman, K.: Failure analysis of jobs in compute clouds: a Google cluster case study. IEEE (2014) [6] Ros, A., Chen, L.Y., Binder, W.: Predicting and mitigating jobs failures in big data clusters. In: 2015 15th IEEE/ACM (2015)

  6. Proposed Solution 6

  7. Experiments and Evaluation Results ▪ Trace Description (Google and LANL) ▪ Experimental Setup ▪ scikit-learn → ML packages in python ▪ Microsoft Azure → Google trace has large volumes of data requiring HPC nodes for analysis and prediction. 7 google/cluster-data: Borg cluster traces from Google - GitHub The Atlas Cluster Trace Repository | USENIX

  8. Experiments and Evaluation Results ▪ Classifiers and Prediction Techniques Fig.5. Performance evaluation of different algorithms applied to the Google trace 8

  9. Experiments and Evaluation Results Fig. 6. Performance evaluation of different algorithms applied to the Mustang and Trinity Traces 9

  10. Experiments and Evaluation Results ▪ Feature Selection Algorithms 10

  11. Conclusion and Future Work • Developing a prediction model for failed jobs based on ML methods. • Detecting failed jobs before the cloud management system schedules them. • Increasing the reliability and availability of the job cloud execution. • Applying different classification algorithms to various workload traces. • In future work, we will develop the proposed model using a deep learning approach to improve the accuracy. • Besides, future research will consider mitigation policies and techniques. 11

  12. Mohammad S. Jassas Qusay H. Mahmoud qusay.mahmoud@ontariotechu.net mohammad.jassas@ontariotechu.net

Recommend


More recommend