griffon reasoning about job anomalies with unlabeled data
play

Griffon: Reasoning about Job Anomalies with Unlabeled Data in - PowerPoint PPT Presentation

Griffon: Reasoning about Job Anomalies with Unlabeled Data in Cloud-based Platforms Liqun Shao, Yiwen Zhu , Siqi Liu*, Abhiram Eswaran, Kristin Lieber, Janhavi Mahajan, Minsoo Thigpen, Sudhir Darbha, Subru Krishnan, Soundar Srinivasan, Carlo


  1. Griffon: Reasoning about Job Anomalies with Unlabeled Data in Cloud-based Platforms Liqun Shao, Yiwen Zhu , Siqi Liu*, Abhiram Eswaran, Kristin Lieber, Janhavi Mahajan, Minsoo Thigpen, Sudhir Darbha, Subru Krishnan, Soundar Srinivasan, Carlo Curino, Konstantinos Karanasos Microsoft, *University of Pittsburgh

  2. Microsoft’s Internal Big Data Analytics Platform 500K 250K (jobs/day) (nodes)

  3. https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=2ahUKEwjv9uXU0__lAhWtIDQIHaU0ABwQjB16BAgBEAM&url=https%3A%2F%2Fwww.intellectualtakeout.org%2Farticle%2Fka nye-wests-private-firefighting-force-good&psig=AOvVaw2pinteqP1A7uhZRdXBfq0J&ust=1574575139344414

  4. My job is SLOW ER …

  5. My job is SLOW ER …

  6. On On-Call Support Engineer Work rkflow 57 mins 88 mins

  7. End-to-End Identify job deployed and used slowdown causes Drops the Consistent results validated investigation time by domain experts

  8. Gri riffon: Before and Aft fter Before Griffon A job goes out of An Engineer spends hours of manual After 2-3 days of investigation, the service-level objectives labor looking through hundreds of reason for job slowdown is found. (SLO) and the engineer metrics is alerted After Griffon The reason is found in the top five generated by Griffon. A job goes out of SLO The Job ID and VC is fed All the metrics Griffon has and the engineer is through Griffon and the top looked at can be ruled out alerted reasons for job slowdown are and the engineer can generated automatically direct their efforts to a smaller set of metrics.

  9. Grif iffon • ML Methodology • System Architecture

  10. Data wrangling Data collection: Identifying the right data Unlabeled data Model building: Small amount of validation data Tradeoff between accuracy and interpretability Cannot maintain models for each job template Deployment and Scalability Evaluation: Evaluation metrics for root causes of slow jobs Challenges

  11. Identify Job Slowdown Reasons Job Runtime Predictor Feature Contributions

  12. Job Runtime Prediction Job Runtime Predictor MARE LR RF GBT DNN Per-Template Model 0.186 0.116 0.124 0.146 Global Model 0.235 0.121 0.277 0.353

  13. Feature Contributions Reformulate decision tree models to linear models: Compare feature contributions to baseline predictions:

  14. Feature Contributions Intercept/Bias 10 m +6 m InputSize -4 m JobPriority BonusPnHours -0 m 12 m Prediction

  15. Intercept 10 m +6 m InputSize -4 m Intercept 10 m JobPriority +3 m InputSize BonusPnHours -0 m JobPriority -2 m 12 m Prediction -4 m Slow Job BonusPnHours InputSize: 6-3 = 3 JobPriority: 4 -2 = 2 7 m Prediction BonusPnHours: 4 – 0 = 4 Baseline Job

  16. Architecture

  17. Azure Big Data Analytics Platform

  18. Azure Big Data Analytics Platform Azure ML with MLFlow: • Archiving • Versioning • Serving

  19. Flask Application

  20. Flask Application

  21. Flask Application

  22. Griffon Output

  23. Job Id Predicted Reason Engineer Validated Rank Confidence Reason Level 9182 Input size Input size 1 High Validation of Griffon Predictions

  24. Job Id Predicted Reason Engineer Validated Rank Confidence Reason Level 9182 Input size Input size 1 High Validation of 8578 Revocation Revocation 4 Medium Griffon Predictions

  25. Job Id Predicted Reason Engineer Validated Rank Confidence Reason Level 9182 Input size Input size 1 High Validation of 8578 Revocation Revocation 4 Medium 4414 Yarn or cluster Yarn or cluster - Low Griffon issue issue 6170 PN hours PN hours 5 Medium Predictions 7588 Time skew Time skew 1 High 3798 PN hours PN hours 1 High 1590 PN hours PN hours 1 High 2560 Usable machine Usable machine 2 High count count

  26. Scalability & Generalization

  27. Conclusions • End-to-end interpretable ranking system to identify the root causes of job slowdowns • No human labeled reasons needed • Highly consistent results validated by on-call engineers • Our model generalizes well by testing on job templates not included in the training set

  28. Thank you! Please see our poster for more details ☺ !

Recommend


More recommend