Generic and Robust Localization of Multi-Dimensional Root Causes Zeyan Li , Chengyang Luo, Yiwei Zhao, Yongqian Sun, Kaixin Sui, Xiping Wang, Dapeng Liu, Xing Jin, Qi Wang , Dan Pei ISSRE 2019
Outline Background Methodology Experiment Summary 2
Outline Background Methodology Experiment Summary 3
Background KPI: key performance indicator ● Anomaly happens, #Orders and we need to find the root cause Time 4
Motivation Timestamp Province ISP Device ....... Raw log for an order: 2019.10.15 13:04 Beijing China Mobile PC ....... Total #Orders Province ISP Province Device ISP PC China Mobile Beijing Cellphone China Unicom Shanghai Guangdong Device Beijing & China Mobile Shanghai & China Mobile 5 Beijing & China Unicom
Multi-dimensional Data ISP • Cuboid: a way to slice the multi-dimensional data Province • Attribute combination: elements in a cuboid Beijing Shanghai Guangdong Device Cuboid Province 6
Multi-dimensional Data ISP • Cuboid: a way to slice the multi-dimensional data Province • Attribute combination: elements in a cuboid Device Cuboid ISP China Mobile China Unicom China Telegram 7
Multi-dimensional Data ISP • Cuboid: a way to slice the multi-dimensional data Province • Attribute combination: elements in a cuboid Shanghai & China Mobile Beijing & China Mobile Device Beijing & China Unicom Cuboid Province & ISP 8
Problem Statement The KPI of the whole cube is abnormal, ISP but where is the root cause? Province Root cause is a set of attribute combinations Device Potential Root Causes 9
Challenge: Huge Search Space Root Cause: a set of attribute combinations How many potential root cause for a simple 2-d data? 2 2 +7+14-1 2 +7 +14-1 2 10
Previous Approaches Algorithm Root Cause Assumption Adtributor (NSDI, 2014) single attribute Recursive Adtributor (Master Thesis, 2018) none Adtributor iDice (ICSE, 2016) one or two attribute combinations Apriori (TON, 2017) none HotSpot (IEEE Access, all attribute combinations of the root iDice 2018) cause in one cuboid those which cause the same changes are Squeeze (ISSRE, 2019) in one cuboid 11
Previous Approaches Algorithm Measure Total Volume China Mobile China Mobile Total Adtributor (NSDI, 2014) fundamental & derived (quotient) China Unicom China Unicom Recursive Adtributor (Master Thesis, 2018) fundamental & derived (quotient) % Success Rate # Orders derived, not additive fundamental, additive iDice (ICSE, 2016) fundamental only Apriori (TON, 2017) fundamental & derived HotSpot (IEEE Access, iDice and HotSpot rely on addition, 2018) fundamental only thus cannot handle derived measures fundamental & derived (quotient, Squeeze (ISSRE, 2019) product) 12
Previous Approaches Change Algorithm Magnitude Adtributor (NSDI, 2014) significant Recursive Adtributor (Master Thesis, 2018) significant Significant Beijing iDice (ICSE, 2016) significant Apriori (TON, 2017) any Shanghai HotSpot (IEEE Access, 2018) significant Insignificant Guangdong Squeeze (ISSRE, 2019) any 13
Previous Approaches Parameter Fine Algorithm Tuning Adtributor (NSDI, 2014) no Recursive Adtributor Some approaches perform badly (Master Thesis, 2018) yes without parameter fine tuning iDice (ICSE, 2016) no Apriori (TON, 2017) yes HotSpot (IEEE Access, 2018) no Squeeze (ISSRE, 2019) no 14
Previous Approaches Algorithm Time Cost Adtributor (NSDI, 2014) very short Recursive Adtributor Some approaches cost too much time (Master Thesis, 2018) short iDice (ICSE, 2016) very short Apriori (TON, 2017) always too long HotSpot (IEEE Access, 2018) sometimes long Squeeze (ISSRE, 2019) short 15
Previous Approach Change Parameter Fine Algorithm Root Cause Assumption Measure Magnitude Tuning Time Cost fundamental & derived Adtributor (NSDI, 2014) single attribute (quotient) significant no very short Recursive Adtributor fundamental & derived (Master Thesis, 2018) none (quotient) significant yes short iDice (ICSE, 2016) one or two attribute combinations fundamental only significant no very short Apriori (TON, 2017) none fundamental & derived any yes always too long HotSpot (IEEE Access, all attribute combinations of the root 2018) cause in one cuboid fundamental only significant no sometimes long those which cause the same changes fundamental & derived Squeeze (ISSRE, 2019) are in one cuboid (quotient, product) any no short 16
Design Goals Change Parameter Fine Root Cause Assumption Measure Magnitude Tuning Time Cost Squeeze has no impractical assumptions handles both fundamental and derived measures handles anomalies with any change magnitude does not need parameter fine tuning is consistently fast in all cases 17
Outline Background Methodology Experiment Summary 18
Core Idea: Generalized Ripple Effect (GRE) With idea from HotSpot[IEEE Access 2018], we propose generalized ripple Effect root cause is Beijing Beijing Shanghai Guangdong 10 20 causes ripples Beijing & China Mobile Beijing & China Unicom 5 10 19
Core Idea: GRE & Deviation Score 𝑔 = 30, 𝑤 = 15, 𝑒𝑡 = 2 Beijing 3 Shanghai Guangdong forecast value: f PDF real value: v 𝑔 = 20, 𝑤 = 10, 𝑒𝑡 = 2 Beijing & China Mobile 3 𝑔 = 10, 𝑤 = 5, 𝑒𝑡 = 2 Beijing & China Unicom 3 𝑒𝑓𝑤𝑗𝑏𝑢𝑗𝑝𝑜 𝑡𝑑𝑝𝑠𝑓 = 2 𝑔 − 𝑤 𝑔 + 𝑤 Deviation Score should in the same bin 20
Core Idea: GRE in Real World Cases # successful orders drops down after an update By manually analysis, root cause is ServiceType =020020 Their deviation scores are in the same bin, which supports GRE 21
Core Idea: GRE in Real World Cases # successful orders drops down 4 root cause attribute combinations Case 2 The data shows that deviation scores of the same root cause are in the same bin 22
Generalized Ripple Effect Does GRE holds for both fundamental and derived measures? Yes . Please see the details in the paper. 23
Core Idea: Generalized Potential Score Evaluate how likely a set of attribute combination is the root cause 24
Core Idea: Generalized Potential Score → KPI value should be expected by GRE 6 789:9;< → = >?@A@BC = 0.5 , half fails → forecast value and real value → 𝑏 𝐶𝑓𝑗𝑘𝑗𝑜, 𝐷ℎ𝑗𝑜𝑏 𝑁𝑝𝑐𝑗𝑚𝑓 = 𝑔 𝐶𝑓𝑗𝑘𝑗𝑜, 𝐷ℎ𝑗𝑜𝑏 𝑁𝑝𝑐𝑗𝑚𝑓 ∗ 0.5 = 5 should be close → 𝑏 𝐶𝑓𝑗𝑘𝑗𝑜, 𝐷ℎ𝑗𝑜𝑏 𝑉𝑜𝑗𝑑𝑝𝑛 = 𝑔 𝐶𝑓𝑗𝑘𝑗𝑜, 𝐷ℎ𝑗𝑜𝑏 𝑉𝑜𝑗𝑑𝑝𝑛 ∗ 0.5 = 10 → f(S2) – v(S2) ~ 0 0 0 25 normalization
Overall Architecture Squeeze 26
Squeeze Top to Bottom: Search in each cluster Root Causes Bottom to Top: clustering for leaf attribute combinations 27
Clustering 28
Clustering local maxima: centroids Find attribute combinations affected by the same root cause Find attribute combinations have similar deviation scores local minima: boundaries 29
Localize in Each Cluster 30
Localize in Cluster CM CU CM CU Sorted List: 2/2 Beijing cluster Beijing Beijing, Shanghai, ...... Shanghai Shanghai 0/2 0/2 Top-K items in this list Province Province ISP with highest GPS 0/2 Beijing, GPS = 1, Root Cause 0/2 Province & ISP 0/2 0/2 31
Outline Background Methodology Experiment Summary 32
Experiment Setup We use ● real KPI datasets from 2 companies; ● synthetic anomalies => 7 semi-synthetic datasets ● Moving average as the forecasting algorithm. 33
Effectiveness Squeeze achieves relatively good F1-score on both fundamental & derived measures. Two of Fundamental Measure Datasets Derived Measure Dataset 34
Efficiency Squeeze is fast enough consistently in all cases. Squeeze costs only ten to twenty seconds consistently in all cases. 35
Various Anomaly Change Magnitude Squeeze performs well regardless of anomaly change magnitudes 0.4% and 12% are 25 and 75 percentile of change magnitudes 36
Various Forecasting Residual Squeeze performs well under various residuals, and always outperforms others. Two representative settings by Moving Average 37
Outline Background Methodology Experiment Summary 38
Summary ● Bottom-up & Top-down => Squeeze ● Contributions: ○ Generalized ripple effect ○ Squeeze algorithm. ○ Experimental study on real world data and semi-synthetic data show Squeeze is both effective and efficient. ● Future Works ○ focus on numerical attributes ○ show GRE for more types of derived measures 39
Recommend
More recommend