Automatic Failure Diagnosis based on Timing Behavior Anomaly Detection in Distributed Java Web Applications Diploma Thesis Presentation Nina S. Marwede Abteilung Software Engineering Fakultät II – Department für Informatik August 26, 2008 First examiner Prof. Dr. Wilhelm Hasselbring Second examiner MIT Matthias Rohr Advisor Dipl.-Inform. André van Hoorn Advisor MIT Matthias Rohr
Contents Motivation 1 Foundations 2 Goals 3 Approach 4 Case Study 5 Conclusions 6 Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 2 / 36
Motivation Motivation for Automatic Failure Diagnosis Software systems are practically never free of faults Software failures have great influence on our lives Large effort for manual diagnosis and debugging Automated processes are required Failure detection 1 Fault localization 2 Fault removal 3 Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 3 / 36
Motivation Motivation for Automatic Failure Diagnosis Software systems are practically never free of faults Software failures have great influence on our lives Large effort for manual diagnosis and debugging Automated processes are required Failure detection 1 Fault localization 2 Fault removal 3 Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 3 / 36
Foundations Monitoring of System Behavior Log files User interfaces Resources Control flow Timing behavior → Instrumentation of hardware/software Kieker [Rohr et al., 2008] Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 4 / 36
Foundations Monitoring of System Behavior Log files User interfaces Resources Control flow Timing behavior → Instrumentation of hardware/software Kieker [Rohr et al., 2008] <<Component>> M M :SequenceAnalysis Sequence <<Component>> <<Component>> M Diagrams M M Database :Tpmon :Tpan M <<Component>> : DependencyAnalysis M Dependency Graphs Software System with Monitoring Instrumentation <<Component>> <<Component>> :TimingAnalysis :TpmonControl Timing Diagrams <<Component>> : ExecutionModelAnalysis Markov Chains Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 4 / 36
Foundations Failure Diagnosis Model checking: explicit messages Timing behavior: throughput, latency, response times Anomaly detection: statistical analysis Correlation: connection of information from different sources Goal: cause instead of symptoms Visualization Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 5 / 36
Goals Goals 1 Design of an approach for fault localization ◮ Timing behavior anomaly detection [Rohr, 2008] ◮ Calling dependencies between components (dependency graphs) ◮ Focus on event correlation ⇒ “Anomaly Correlator” 2 Evaluation: Case Study ◮ Java Web Application: iBATIS JPetStore ◮ Workload Generation: Markov4JMeter [van Hoorn et al., 2008] ◮ Fault Injection Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 6 / 36
Approach Contents Motivation 1 Foundations 2 Goals 3 Approach 4 Case Study 5 Conclusions 6 Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 7 / 36
Approach Solution Idea Solution Idea Correlation: Draw conclusions from the arrangement of the anomalies in the calling dependency graph A B C D E normal unsure anomalous F G Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 8 / 36
Approach Implementation Implementation Extension to existing software “Kieker” [Rohr et al., 2008] Tpmon stores monitoring data, Tpan with its plug-ins analyzes it Correlator: Plug-in for Tpan Tpan Textual Output Correlator 1 4 Model Visualization Building Anomaly Graphs 2 3 Anomaly Execution Cause Detector Aggregation Estimation Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 9 / 36
Approach Assumptions Assumptions Correct failure detection Correct anomaly scoring Failure has distinct cause Exactly one failure in the observation period Anomaly propagation Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 10 / 36
Approach Input Data Input Data A B C 1 Calling dependencies between operations Comp VM Start RT Anomaly ___________________________ ... A X 0001 8 0.6 2 Anomalies in the timing C Y 0002 1 −0.2 behavior of executions B X 0004 4 0.9 C Y 0006 2 0.3 ... Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 11 / 36
Approach Step 1: Preparation Step 1: Preparation of Data Structures Generation of calling dependency graphs from traces Connection of anomalies with software architecture $ doGet(HttpServletRequest,HttpServletResponse) doPost(HttpServletRequest,HttpServletResponse) viewItem() addItemToCart() viewCategory() newOrder() signon() getItem(String) getCategory(String) insertOrder(Order) getProductListByCategory(String) Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 12 / 36
Approach Challenges Challenges (1/2) Aggregation: How to aggregate a number of anomaly scores into one value? Four places are involved: Three architectural levels, and neighbors on operation level Five methods are evaluated: Median, power mean (three exponents), maximum Number of 0.2 executions Anomaly score Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 13 / 36
Approach Challenges Challenges (2/2) Correlation: How to recognize the propagation of an anomaly? Consider the perspective of each component Three algorithms are evaluated: Trivial, simple, advanced A B C D E F G Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 14 / 36
Approach Challenges Challenges (2/2) Correlation: How to recognize the propagation of an anomaly? Consider the perspective of each component Three algorithms are evaluated: Trivial, simple, advanced A B C D E F G Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 14 / 36
Approach Challenges Challenges (2/2) Correlation: How to recognize the propagation of an anomaly? Consider the perspective of each component Three algorithms are evaluated: Trivial, simple, advanced A B C D E F G Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 14 / 36
Approach Step 2: Processing Step 2: Processing of Anomaly Scores Three algorithms 1 Trivial: Simple aggregation, no correlation 2 Simple: Simple aggregation, “pessimistic” correlation 3 Advanced: Weighted configurable aggregation, “optimistic” correlation Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 15 / 36
Approach Step 2: Processing Trivial Algorithm Aggregation: Unweighted arithmetic mean on each level Correlation: None Application ... Deployment Context Deployment ... Component Component Component ... ... Operation Operation Operation Operation Operation ... Execution Execution Execution Execution Execution Execution Execution Execution Execution Execution Execution ... ... Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 16 / 36
Approach Step 2: Processing Simple Algorithm 1 Rule 1: Mean of anomaly ratings of directly connected callers . . . relative high? ⇒ Increase rating 2 Rule 2: Maximum of anomaly ratings of directly connected callees . . . relative high? ⇒ Decrease rating A B C D E F G Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 17 / 36
Approach Step 2: Processing Simple Algorithm 1 Rule 1: Mean of anomaly ratings of directly connected callers . . . relative high? ⇒ Increase rating 2 Rule 2: Maximum of anomaly ratings of directly connected callees . . . relative high? ⇒ Decrease rating A B 0.5 0.2 C D E −0.6 1.0 −0.7 F G −0.2 −0.6 Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 17 / 36
Approach Step 2: Processing Advanced Algorithm Aggregation K L M ◮ In addition to arithmetic mean: median, power mean, maximum H I J 958 4312 4612 Correlation A B ◮ Consideration of call frequencies 3256 231 564 (edges in CDG) C D E ◮ Transitive closure of callers ◮ Transitive closure of callees F G Nina Marwede (Univ. of Oldenburg) Failure Diagnosis based on Timing Behavior Aug 26, 2008 18 / 36
Recommend
More recommend