fluxrank a widely deployable framework to automatically
play

FluxRank: A Widely-Deployable Framework to Automatically Localizing - PowerPoint PPT Presentation

FluxRank: A Widely-Deployable Framework to Automatically Localizing Root Cause Machines for Software Service Failure Mitigation Ping Liu 1 , Yu Chen 2 , Xiaohui Nie 1 , Jing Zhu 1 , Shenglin Zhang 3 , Kaixin Sui 4 Ming Zhang 5 , Dan Pei 1 1 2 3


  1. FluxRank: A Widely-Deployable Framework to Automatically Localizing Root Cause Machines for Software Service Failure Mitigation Ping Liu 1 , Yu Chen 2 , Xiaohui Nie 1 , Jing Zhu 1 , Shenglin Zhang 3 , Kaixin Sui 4 Ming Zhang 5 , Dan Pei 1 1 2 3 4 5

  2. ���������� ������ ���������� ����������� �

  3. ���������� ������ ���������� ����������� �

  4. Background Why we focus on failure mitigation ? Because it took too long for a complex distributed service �

  5. Service outages of 2019 Three and a half hours before successful mitigation ������������������������������������������������������������������������������������������ �

  6. Service outages of 2019 Almost a full day before successful mitigation ������������������������������������������������������������������������������������������ �

  7. Service outages of 2019 Almost three hour before successful mitigation ������������������������������������������������������������������������������������������ �

  8. Background So Long !!! � Mitigation Time Our algorithm cuts the Three and a half hours • mitigation time by more A full day • Three hour • than 80% on average. ... • �

  9. FluxRank Fluc tuation of KPI Rank �

  10. Background Failure mitigation takes too much time. Why? ��

  11. Troubleshooting process Critical KPI Response time Error rate … Monitor ������� ��

  12. Troubleshooting process Response Time Failure start time operator Confirmation ��

  13. Troubleshooting process Response Time Mitigation Failure start time Switch traffic • Rollback version • ������������������� �������������������� �������� Restart instances • … • ������������������������������������������������ Mitigation start time ���������������� operator Confirmation Mitigation ��

  14. Troubleshooting process Response Time Root cause analysis • Analyze source codes Failure start time Analyze logs • … • Mitigation start time developer Confirmation Mitigation Root Cause Analysis ��

  15. Troubleshooting process Response Time How do operators mitigate the failure? operator Confirmation Mitigation Root Cause Analysis ��

  16. Mitigation Software Service Hundreds of Web server Database Computation � modules Tens of Data Center Data Center Data Center Hundreds of … … … machines ��������������� ������������������ Hundreds of ���������������� KPIs ������������������� �� ��

  17. Mitigation failed Software Service Web server Database Computation � Data Center Data Center Data Center … … … ��������������� ������������������ ���������������� ������������������� �� ��

  18. Mitigation Anomaly detection by statistic methods, like static threshold, 3-sigma, etc. ��������������� ������������������ alert ���������������� ������������������� �� ��

  19. Mitigation failed Software Service Because of the dependencies between Web server Database Computation � modules and machines Data Center Data Center Data Center … … … Failures will propagate between ��������������� modules and machines ������������������ ���������������� ������������������� �� ��

  20. Mitigation alert Software Service Web server Database Computation � alert Alerts will be found everywhere ! Data Center Data Center Data Center … … … ��������������� ������������������ ���������������� alert ������������������� �� ��

  21. Mitigation Try possible reasons one by one Find possible failure reason Take mitigation actions by history experience • Switch traffic Workload increase? • • Rollback version System updated events? • • Restart instances operator New service is online? • • … … • ��

  22. Mitigation If the mitigation is failed by trying possible reasons Operators will manually scan KPIs to find the root cause location ��

  23. Mitigation Why are operators reluctant to check the codes and exception logs? Only service developers can understand the details of codes and exception logs developer ��

  24. Mitigation Why are operators reluctant to check the codes and exception logs? Operators mostly scan the KPIs to monitor the running status of modules and machines operator ��

  25. Mitigation ��������������� Database Web server ������������������ ���������������� Computation � ������������������� �� Tens of thousands of machines Hundreds of KPIs Hundreds of modules The search space is too huge ! ��

  26. Mitigation ��������������� Database Web server ������������������ ���������������� Computation � ������������������� �� Tens of thousands of machines Hundreds of KPIs Hundreds of modules I have to mitigate the failure quickly! ��

  27. Mitigation Root cause location can be localized along the dependency graph • Dependency graph based approaches Web server • Sherlock [SIGCOMM’ 07] • MonitorRank [SIGMETRICS’ 13] • Fchain [ICDCS’ 13] Database � • CauseInfer [INFOCOM’ 14] • BRCA [IPCCC’ 16] Computation • Dependency graph represents the Dependency graph dependencies between modules ��

  28. Mitigation In practice, automatically obtaining the Web server dependency graph of a online complex distributed service is difficult: Database � Additional data collection codes need to • be added, like Google’s Dapper. Computation For an online complex distributed • service, it is infeasible. Dependency graph ��

  29. Mitigation The dependency graph also can be manually Web server obtained by the experience of developers and operators: Database � Maintaining the graphs for the rapidly • changing software services is difficult Computation because the quick change of the codes • makes the dependency graph elusive. Dependency graph ��

  30. Mitigation Therefore, in practice, the localizing process is still a manual process. ��

  31. Core idea If the manually scanning process can be automated by machine learning, then the overall mitigation time can be greatly reduced. Machine learning Root cause machine KPIs Root cause machine algorithm Root cause machine ��

  32. Directly training machine learning models Core idea in an end-to-end manner does not work Machine learning Root cause machine KPIs` Root cause machine algorithm Root cause machine Lack of interpretability. • Insufficient failure cases. • ��

  33. Core idea Domain-knowledge can be utilized to divide the problem into several phases Each phase has sufficient data and Machine learning Root cause machine KPIs Root cause machine interpretable algorithm can be used Root cause machine algorithm Phase 2 Phase 1 Phase … ��

  34. Manual localization without dependency graph Software Service Web server Database Computation � Step-1 : scan the KPIs to understand the status of Data Center Data Center Data Center machines … … … ��������������� ������������������ ���������������� ������������������� �� ��

Recommend


More recommend