Error Log Processing for Accurate Failure Prediction Felix Salfner Steffen Tschirpke ICSI Berkeley Humboldt-Universität zu Berlin
Introduction Introduction ■ Context of work: Error-based online failure prediction: error events C A C B failure? t data window prediction present time ■ Data used: ● Commercial telecommunication system ● 200 components, 2000 classes ● Error- and failure logs → In this talk we present the data preprocessing concepts we applied to obtain accurate failure prediction results 2 15
Contents Contents ■ Key facts on the data ■ Overview of online failure prediction and data preprocessing process ■ Detailed description of major preprocessing concepts ● Assigning IDs to Error Messages ● Failure Sequence Clustering ● Noise Filtering ■ Experiments and Results 3 15
Key Facts on the Data Key Facts on the Data ■ Experimental setup: Telecommunication System Call Tracker response times error logs failure log ■ 200 days of data from a 273 days period ■ 26,991,314 error log records ■ 1,560 failures of two types ■ Failure Definition: response time calls ● If within a 5 min interval more than 0.01% of calls 250ms experience a response time > 250ms t ● Performance Failures 5 min 5 min ≤ 0.01% ⇒ ✓ > 0.01% Failure! 4 15
Online Failure Prediction Online Failure Prediction ■ Approach: Pattern recognition using Hidden Semi-Markov Models Failure Failure Non-Failure Non-Failure Sequence 1 Sequence 2 Sequence 1 Sequence 2 B C A F A B C B A B F C B A t t d t l t d t d t l t d HSMM for Failure Ssequences HSMM for Non-Failure Sequences ■ Objectives for data preprocessing: ● Create a data set to train HSMM models exposing key properties of system ● Identify how to process incoming data during runtime ■ Tasks: ● Machine-processable data → Error-ID assignment ● Separate sequences for inherent failure mechanisms → Clustering ● Distinguishing, noise-free sequences → Noise Filtering 5 15
Training Data Preprocessing Training Data Preprocessing Error Log Failure Log Error-ID assignment Timestamp extraction Tupling Sequence Extraction Non-Failure Sequences Failure Sequences Clustering Noise Filtering 1 Noise Filtering u ... Sequences Sequences for Failure-Mechanism 1 for Failure-Mechanism u 6 ... 15 Model 0 Model 1 Model u
Error ID Assignment Error ID Assignment ■ Problem: Error logs contain no message IDs ● Example message of a log record: process 1534: end of buffer reached → Task: Assign an ID to message to characterize what has happened ■ Approach: Two steps: ● Remove numbers process xx: end of buffer reached ● ID assignment based on Levenshtein's edit distance with constant threshold Data No of Messages Reduction Original 1,695,160 Without numbers 12,533 Levenshtein 1,435 7 15
Failure Sequence Clustering Failure Sequence Clustering Error Log Failure Log Error-ID assignment Timestamp extraction Tupling Sequence Extraction Non-Failure Sequences Failure Sequences Clustering Noise Filtering 1 Noise Filtering u ... Sequences Sequences for Failure-Mechanism 1 for Failure-Mechanism u 8 ... 15 Model 0 Model 1 Model u
Failure Sequence Clustering (2) Failure Sequence Clustering (2) ■ Goal: ● Divide set of training failure sequences into subsets ● Group according to sequence similarity F 1 F 2 F 3 ■ Approach: A B ▾ A B A A C B A B ▾ ▾ ● Train a small HSMM for each sequence ● Apply each HSMM to all sequences ... ... ... M 1 ... ... ... -2.1 -4.2 -9.7 ... ● Sequence log-likelihoods express ... ... similarities ... ... ... M 2 ... ... ... -2.6 -1.3 -10.2 ... ... ... ... ... ... M 3 ... ... ... -7.8 -6.9 -1.2 ... ... ... ● Make matrix symmetric by ● Apply standard clustering algorithm 9 15
Failure Sequence Clustering (3) Failure Sequence Clustering (3) 10 15
Noise Filtering Noise Filtering Error Log Failure Log Error-ID assignment Timestamp extraction Tupling Sequence Extraction Non-Failure Sequences Failure Sequences Clustering Noise Filtering 1 Noise Filtering u ... Sequences Sequences for Failure-Mechanism 1 for Failure-Mechanism u 11 ... 15 Model 0 Model 1 Model u
Noise Filtering (2) Noise Filtering (2) ■ Problem: Clustered failure sequences contain many unrelated errors → Main reason: parallelism in the system ■ Assumption: Indicative events occur more frequently prior to a failure than within other sequences → Apply a statistical test to quantify what “more frequently” is F 1 F 2 F 3 F 4 F 5 A B ▾ B A ▾ A B A ▾ A C B A ▾ B A A B A ▾ C Clustering Filtering Group 1 Filtering Group n A B A ▾ ... t B A B A B A F 1 F 3 ▾ ▾ A C B A A C B A F 2 ▾ ✘ ▾ ✘ t C B A A A B A F 5 ▾ F 4 ▾ ✘ A A B A ▾ ✘ t time of 12 Training Sequences for Training Sequences for failure Failure Mechanism 1 Failure Mechanism n 15
Noise Filtering (3) Noise Filtering (3) ■ Testing variable derived from goodness-of-fit test: denotes the number of occurrences of error denotes the total number of errors in the time window. denotes the prior probability of occurrence of error ■ Keep events in the sequence if Training sequences Entire dataset G 2 Failure G 1 G 4 training ■ Three ways to estimate priors G 3 sequences from training data set ■ Results 13 15
Experiments and Results Experiments and Results ■ Objective: Predict upcoming failures as accurate as possible ■ Metric used: F-Measure: ● Precision: relative number of correct alarms to total number of alarms ● Recall: relative number of correct alarms to total number of failures ● F-Measure: harmonic mean of precision and recall ■ Failure prediction is achieved by comparing sequence likelihood of an incoming sequence computed from failure and non-failure models ■ Classification involves a B C A t customizable decision threshold t d → Maximum F-Measure HSMM for HSMM for failure non-failure Data Max. F- Relative sequences sequences Measure Quality Optimal Results 0.66 Sequence Sequence 100% likelihood likelihood Without grouping 0.5097 77% classification Without filtering 0.3601 55% 14 15 Failure prediction
Conclusions Conclusions ■ We have presented the data preprocessing techniques that we have applied for online failure prediction in a commercial telecommunication system ■ The presented techniques include: ● Assignment of IDs to error messages using Levenshtein's edit distance ● Failure sequence clustering ● Noise filtering based on a statistical test ■ Using error and failure logs of the commercial telecommunication system, we showed that elaborate data preprocessing is an essential step to achieve accurate failure predictions 15 15
Backup 16 15
Tupling Tupling ■ Goal: Remove multiple reporting of the same issue ■ Approach: Combine messages of the same type if they occur closer in time to each other than a threshold ε. ■ Problem: ● Determine the threshold value ε ● Solution suggested by Tsao and Siewiorek: Observe the number of tuples for various values of ε and apply the “elbow rule” ε 15
HSMM Model Structure for HSMM Model Structure for Failure Sequence Clustering Failure Sequence Clustering s 2 s 1 s 3 F s 5 s 4 15
Cluster Distance Metrics Cluster Distance Metrics Single linkage Single linkage complete linkage Average linkage 15
Online Failure Prediction Online Failure Prediction Error messages Error ID assignment error message sequence Tupling Sequence Extraction ... Filtering 1 Filtering u ... Model 0 Model 1 Model u ... Sequence Sequence Sequence Likelihood 0 Likelihood 1 Likelihood u Classification Failure Prediction 15
Comparison of Techniques Comparison of Techniques periodic DFT 0.7 Eventset SVD-SVM 0.6 HSMM 0.4 0.2 0.0 precision recall F-measure false positive rate 21 15
Hidden Semi-Markov Model Hidden Semi-Markov Model g 13 (t) t ... 1 2 3 N-1 F g 23 (t) g 12 (t) b 1 (A) b 2 (A) b 3 (A) b N-1 (A) 0 t t ... b 1 (B) b 2 (B) b 3 (B) b N-1 (B) 0 b 1 (C) b 2 (C) b 3 (C) b N-1 (C) 0 0 0 0 0 1 ■ Discrete time Markov chain (DTMC) ● States (1,..., N-1,F) ● Transition probabilities ■ Hidden Markov Model (HMM) ● Each state can generate (error) symbols (A,B,C,F) ● Discrete probability distribution of symbols per state b i (X) ■ Hidden Semi-Markov Model (HSMM) 22 ● Time-dependent transition probabilities g ij (t) 15
Proactive Fault Management Proactive Fault Management Running System Measurements Failure Preparation Avoidance for Failure Prediction Online Failure Prediction Model 23 15
Recommend
More recommend