FluxRank: A Widely-Deployable Framework to Automatically Localizing - PowerPoint PPT Presentation

FluxRank: A Widely-Deployable Framework to Automatically Localizing Root Cause Machines for Software Service Failure Mitigation Ping Liu 1 , Yu Chen 2 , Xiaohui Nie 1 , Jing Zhu 1 , Shenglin Zhang 3 , Kaixin Sui 4 Ming Zhang 5 , Dan Pei 1 1 2 3 4 5

��

Background Why we focus on failure mitigation ? Because it took too long for a complex distributed service �

Service outages of 2019 Three and a half hours before successful mitigation ��

Service outages of 2019 Almost a full day before successful mitigation ��

Service outages of 2019 Almost three hour before successful mitigation ��

Background So Long !!! � Mitigation Time Our algorithm cuts the Three and a half hours • mitigation time by more A full day • Three hour • than 80% on average. ... • �

FluxRank Fluc tuation of KPI Rank �

Background Failure mitigation takes too much time. Why? ��

Troubleshooting process Critical KPI Response time Error rate … Monitor ��

Troubleshooting process Response Time Failure start time operator Confirmation ��

Troubleshooting process Response Time Mitigation Failure start time Switch traffic • Rollback version • �� Restart instances • … • �� Mitigation start time �� operator Confirmation Mitigation ��

Troubleshooting process Response Time Root cause analysis • Analyze source codes Failure start time Analyze logs • … • Mitigation start time developer Confirmation Mitigation Root Cause Analysis ��

Troubleshooting process Response Time How do operators mitigate the failure? operator Confirmation Mitigation Root Cause Analysis ��

Mitigation Software Service Hundreds of Web server Database Computation � modules Tens of Data Center Data Center Data Center Hundreds of … … … machines �� Hundreds of �� KPIs ��

Mitigation failed Software Service Web server Database Computation � Data Center Data Center Data Center … … … ��

Mitigation Anomaly detection by statistic methods, like static threshold, 3-sigma, etc. �� alert ��

Mitigation failed Software Service Because of the dependencies between Web server Database Computation � modules and machines Data Center Data Center Data Center … … … Failures will propagate between �� modules and machines ��

Mitigation alert Software Service Web server Database Computation � alert Alerts will be found everywhere ! Data Center Data Center Data Center … … … �� alert ��

Mitigation Try possible reasons one by one Find possible failure reason Take mitigation actions by history experience • Switch traffic Workload increase? • • Rollback version System updated events? • • Restart instances operator New service is online? • • … … • ��

Mitigation If the mitigation is failed by trying possible reasons Operators will manually scan KPIs to find the root cause location ��

Mitigation Why are operators reluctant to check the codes and exception logs? Only service developers can understand the details of codes and exception logs developer ��

Mitigation Why are operators reluctant to check the codes and exception logs? Operators mostly scan the KPIs to monitor the running status of modules and machines operator ��

Mitigation �� Database Web server �� Computation � �� Tens of thousands of machines Hundreds of KPIs Hundreds of modules The search space is too huge ! ��

Mitigation �� Database Web server �� Computation � �� Tens of thousands of machines Hundreds of KPIs Hundreds of modules I have to mitigate the failure quickly! ��

Mitigation Root cause location can be localized along the dependency graph • Dependency graph based approaches Web server • Sherlock [SIGCOMM’ 07] • MonitorRank [SIGMETRICS’ 13] • Fchain [ICDCS’ 13] Database � • CauseInfer [INFOCOM’ 14] • BRCA [IPCCC’ 16] Computation • Dependency graph represents the Dependency graph dependencies between modules ��

Mitigation In practice, automatically obtaining the Web server dependency graph of a online complex distributed service is difficult: Database � Additional data collection codes need to • be added, like Google’s Dapper. Computation For an online complex distributed • service, it is infeasible. Dependency graph ��

Mitigation The dependency graph also can be manually Web server obtained by the experience of developers and operators: Database � Maintaining the graphs for the rapidly • changing software services is difficult Computation because the quick change of the codes • makes the dependency graph elusive. Dependency graph ��

Mitigation Therefore, in practice, the localizing process is still a manual process. ��

Core idea If the manually scanning process can be automated by machine learning, then the overall mitigation time can be greatly reduced. Machine learning Root cause machine KPIs Root cause machine algorithm Root cause machine ��

Directly training machine learning models Core idea in an end-to-end manner does not work Machine learning Root cause machine KPIs` Root cause machine algorithm Root cause machine Lack of interpretability. • Insufficient failure cases. • ��

Core idea Domain-knowledge can be utilized to divide the problem into several phases Each phase has sufficient data and Machine learning Root cause machine KPIs Root cause machine interpretable algorithm can be used Root cause machine algorithm Phase 2 Phase 1 Phase … ��

Manual localization without dependency graph Software Service Web server Database Computation � Step-1 : scan the KPIs to understand the status of Data Center Data Center Data Center machines … … … ��

FluxRank: A Widely-Deployable Framework to Automatically Localizing - PowerPoint PPT Presentation

FluxRank: A Widely-Deployable Framework to Automatically Localizing Root Cause Machines for Software Service Failure Mitigation Ping Liu 1 , Yu Chen 2 , Xiaohui Nie 1 , Jing Zhu 1 , Shenglin Zhang 3 , Kaixin Sui 4 Ming Zhang 5 , Dan Pei 1 1 2 3

Less Pain, Most of the Gain: Incrementally Deployable ICN

Enabling Venus In-Situ Science Deployable Entry System Technology, Adaptive Deployable Entry

FPGP: Graph Processing Framework on FPGA Guohao DAI, Yuze CHI, Yu WANG, Huazhong YANG E.E.

IT Infrastructure Management User-Friendly End-to-End Easily Customizable & Deployable

Importance of structural damping in the dynamic analysis of compliant deployable structures

Terminus: Towards a Network-Level Deployable Architecture Against Distributed Denial-of- Service

RESULTS OF THE DEPLOYABLE MEMBRANE & ADEO PASSIVE DE-ORBIT SUBSYSTEM ACTIVITIES LEADING TO A

Low Creep/Low Relaxation Thermoplastic Polymer Composites for Deployable Structures Kyle Horn

SDL Control of the UltraLITE Precision Deployable Test Article Using Adaptive Spatio-Temporal

Mixtasy: Remailing on Existing Infrastructure Anonymized Email Communication Easily Deployable

An Radical but Incrementally Deployable Vision for Future Internet Architecture James McCauley,

Deployable Boom Arm for a Double Langmuir Probe Presenter: Charles Van Steenwyk Visiting

Shuffler: Fast and Deployable Continuous Code Re-Randomization David Williams-King, Graham

Rapidly Deployable Wireless Networks for Emergency Communications & Sensing Applications

Transportable & Re-deployable Modular Hangar Facility Product Description The transportable

Post IPv4 completion Making IPv6 deployable incrementally by making it backward compatible

Automatically Constructing Semantic Web Services from Online Sources Craig A. Knoblock Jos

In prokaryotic cells (which are widely used as host cells in bio- engineering), enzymes are

Ubiquitous and Mobile Computing CS 528: Automatically Characterizing Places with Opportunistic

Original Idea behind P3P Original Idea behind P3P n A framework for automated privacy discussions

1 Widely distributed in the environment plants, soil, animal, water, dirt, dust may be

Automatically Documenting Program Changes Ray Buse Wes Weimer De ltaDoc 13 May 2011 diffs

TABLA: A Framework for Accelerating Statistical Machine Learning Presenters: MeiXing Dong,

An Automatically Built Named Entity Lexicon for Arabic M. Attia, A. Toral , L. Tounsi*, M.

FluxRank: A Widely-Deployable Framework to Automatically Localizing - PowerPoint PPT Presentation

FluxRank: A Widely-Deployable Framework to Automatically Localizing Root Cause Machines for Software Service Failure Mitigation Ping Liu 1 , Yu Chen 2 , Xiaohui Nie 1 , Jing Zhu 1 , Shenglin Zhang 3 , Kaixin Sui 4 Ming Zhang 5 , Dan Pei 1 1 2 3

Less Pain, Most of the Gain: Incrementally Deployable ICN

Enabling Venus In-Situ Science Deployable Entry System Technology, Adaptive Deployable Entry

FPGP: Graph Processing Framework on FPGA Guohao DAI, Yuze CHI, Yu WANG, Huazhong YANG E.E.

IT Infrastructure Management User-Friendly End-to-End Easily Customizable &amp; Deployable

Importance of structural damping in the dynamic analysis of compliant deployable structures

Terminus: Towards a Network-Level Deployable Architecture Against Distributed Denial-of- Service

RESULTS OF THE DEPLOYABLE MEMBRANE &amp; ADEO PASSIVE DE-ORBIT SUBSYSTEM ACTIVITIES LEADING TO A

Low Creep/Low Relaxation Thermoplastic Polymer Composites for Deployable Structures Kyle Horn

SDL Control of the UltraLITE Precision Deployable Test Article Using Adaptive Spatio-Temporal

Mixtasy: Remailing on Existing Infrastructure Anonymized Email Communication Easily Deployable

An Radical but Incrementally Deployable Vision for Future Internet Architecture James McCauley,

Deployable Boom Arm for a Double Langmuir Probe Presenter: Charles Van Steenwyk Visiting

Shuffler: Fast and Deployable Continuous Code Re-Randomization David Williams-King, Graham

Rapidly Deployable Wireless Networks for Emergency Communications &amp; Sensing Applications

Transportable &amp; Re-deployable Modular Hangar Facility Product Description The transportable

Post IPv4 completion Making IPv6 deployable incrementally by making it backward compatible

Automatically Constructing Semantic Web Services from Online Sources Craig A. Knoblock Jos

In prokaryotic cells (which are widely used as host cells in bio- engineering), enzymes are

Ubiquitous and Mobile Computing CS 528: Automatically Characterizing Places with Opportunistic

Original Idea behind P3P Original Idea behind P3P n A framework for automated privacy discussions

1 Widely distributed in the environment plants, soil, animal, water, dirt, dust may be

Automatically Documenting Program Changes Ray Buse Wes Weimer De ltaDoc 13 May 2011 diffs

TABLA: A Framework for Accelerating Statistical Machine Learning Presenters: MeiXing Dong,

An Automatically Built Named Entity Lexicon for Arabic M. Attia*, A. Toral *, L. Tounsi*, M.

IT Infrastructure Management User-Friendly End-to-End Easily Customizable & Deployable

RESULTS OF THE DEPLOYABLE MEMBRANE & ADEO PASSIVE DE-ORBIT SUBSYSTEM ACTIVITIES LEADING TO A

Rapidly Deployable Wireless Networks for Emergency Communications & Sensing Applications

Transportable & Re-deployable Modular Hangar Facility Product Description The transportable

An Automatically Built Named Entity Lexicon for Arabic M. Attia, A. Toral , L. Tounsi*, M.