Evaluation of a Failure Prediction Model for Large Scale Cloud Applications Mohammad S. Jassas and Qusay H. Mahmoud Presentation at Canadian AI 2020
Introduction ▪ Cloud services → Complexity for cloud architectures. ▪ Cloud applications have a high probability of failures ▪ Most Cloud providers have experienced failure in one of their services ▪ AWS experienced failure in (EBS) [7]. ▪ Many organizations are planning to use public cloud environments. ▪ Cloud providers → Maintaining their services to provide cloud consumers with a high level of QoS). 2 [7] P. Marshall, K. Keahey, T. Freeman, Elastic site: Using clouds to elastically extend site resources, in: 10th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2010).
Problem Statement • Providing 24x7 services uptime become one of the most significant challenges faces the cloud providers. • Failed jobs consume a notable amount of computational resources and memory. 3
Objective High Resource wastage Reliability + Availability Decrease the number of failed tasks Increase Minimize Time + Cost the performance of Cloud apps 4
Related Work ▪ Failure analysis and characterization have been studied widely in grid computing, cloud cluster and supercomputer [1]. ▪ The Google traces [3] are used in different research studies, including workload characterization [5] and applying statistical methods. ▪ In [2], we have studied the workload features such as memory usage, CPU speed, disk space. ▪ Limited research has been done on failure prediction [4,5,6]. ▪ El-Sayed et al. [4] have designed a job failure prediction model using a RF classifier. [1] Chen, X., Lu, C.D., Pattabiraman, K.: Failure analysis of jobs in compute clouds: a Google cluster case study. IEEE (2014). [4] El-Sayed, N., Zhu, H., Schroeder, B.: Learning from failure across multiple clusters: a trace-driven approach to [2] Jassas, M., Mahmoud, Q.H.: Failure analysis and characterization of scheduling jobs in Google cluster trace. IEEE (2018) understanding, predicting, and mitigating job terminations. IEEE (2017) 5 [3] Reiss, C., Wilkes, J., Hellerstein, J.L.: Google cluster-usage traces: format+ schema. Google Inc., (2011) [5] Chen, X., Lu, C.D., Pattabiraman, K.: Failure analysis of jobs in compute clouds: a Google cluster case study. IEEE (2014) [6] Ros, A., Chen, L.Y., Binder, W.: Predicting and mitigating jobs failures in big data clusters. In: 2015 15th IEEE/ACM (2015)
Proposed Solution 6
Experiments and Evaluation Results ▪ Trace Description (Google and LANL) ▪ Experimental Setup ▪ scikit-learn → ML packages in python ▪ Microsoft Azure → Google trace has large volumes of data requiring HPC nodes for analysis and prediction. 7 google/cluster-data: Borg cluster traces from Google - GitHub The Atlas Cluster Trace Repository | USENIX
Experiments and Evaluation Results ▪ Classifiers and Prediction Techniques Fig.5. Performance evaluation of different algorithms applied to the Google trace 8
Experiments and Evaluation Results Fig. 6. Performance evaluation of different algorithms applied to the Mustang and Trinity Traces 9
Experiments and Evaluation Results ▪ Feature Selection Algorithms 10
Conclusion and Future Work • Developing a prediction model for failed jobs based on ML methods. • Detecting failed jobs before the cloud management system schedules them. • Increasing the reliability and availability of the job cloud execution. • Applying different classification algorithms to various workload traces. • In future work, we will develop the proposed model using a deep learning approach to improve the accuracy. • Besides, future research will consider mitigation policies and techniques. 11
Mohammad S. Jassas Qusay H. Mahmoud qusay.mahmoud@ontariotechu.net mohammad.jassas@ontariotechu.net
Recommend
More recommend