timing behavior anomaly detection for automatic failure
play

Timing Behavior Anomaly Detection for Automatic Failure Detection - PowerPoint PPT Presentation

Timing Behavior Anomaly Detection for Automatic Failure Detection and Diagnosis Research visit at Charles Univerity Prague Matthias Rohr matthias.rohr@informatik.uni-oldenburg.de Graduate School TrustSoft, Software Engineering Group Department


  1. Timing Behavior Anomaly Detection for Automatic Failure Detection and Diagnosis Research visit at Charles Univerity Prague Matthias Rohr matthias.rohr@informatik.uni-oldenburg.de Graduate School TrustSoft, Software Engineering Group Department of Computing Science, University of Oldenburg 10th of April 2007 Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 1 / 32

  2. Motivation Motivation Administrators Complex Users Software System Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 2 / 32

  3. Motivation Motivation Administrators Complex Users Software System Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 2 / 32

  4. Motivation Motivation Administrators Complex Users Software System Failure diagnosis in business-critical software systems Manual failure diagnosis is time-consuming and error-prone Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 2 / 32

  5. Motivation Motivation Failure Diagnosis Diagnosis Log of Report Runtime Behavior Measurements M M M M M M M Administrators M Complex Users Software System with Monitoring Failure diagnosis in business-critical software systems Manual failure diagnosis is time-consuming and error-prone Runtime behavior observations are indicative for failure diagnosis Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 2 / 32

  6. Motivation Motivation P ft = 0.8 <<Component>> :Catalog P ft = 0.08 P ft = 0.12 <<Component>> :Bookshop <<Component>> :CRM Vision Automatic localization of faults through runtime behavior evaluation Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 3 / 32

  7. Motivation Approach Automatic localization of faults through runtime behavior evaluation Automatic detection of timing behavior anomalies in software systems Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 4 / 32

  8. Motivation Approach Automatic localization of faults through runtime behavior evaluation Automatic detection of timing behavior anomalies in software systems Research questions: How can anomalies be detected in timing behavior? How can system usage variations be adresses in timing behavior evaluation? What is the relation between software faults and runtime timing behavior? Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 4 / 32

  9. Foundations Outline Foundations 1 Dependability Anomaly Detection Software Performance Creation of the timing behavior profile 2 Fault Localization 3 Evaluation 4 Related work 5 Conclusions 6 Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 5 / 32

  10. Foundations Dependability Dependability Terminology [Aviˇ zienis et al., 2004] Threats to dependability Fault Root-cause of a failure Error Incorrect system state Failure Deviation from correct system behavior visible to the user Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 6 / 32

  11. Foundations Dependability Dependability Terminology [Aviˇ zienis et al., 2004] Threats to dependability Fault Root-cause of a failure Error Incorrect system state Failure Deviation from correct system behavior visible to the user Failure Diagnosis: Failure detection Identification of faults Fault localization Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 6 / 32

  12. Foundations Dependability Availability Availability: Common definition (e.g., [Musa et al., 1987]) MTTF Availability = MTTF + MTTR MTTF Mean Time to Failure MTTR Mean Time to Repair Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 7 / 32

  13. Foundations Dependability Availability Availability: Common definition (e.g., [Musa et al., 1987]) MTTF Availability = MTTF + MTTR MTTF Mean Time to Failure MTTR Mean Time to Repair Two alternative strategies to increase availability Increase of mean time to failure (reliability) Decrease of mean time to repair Failure diagnosis support Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 7 / 32

  14. Foundations Anomaly Detection Anomaly Detection (1/2) System System behavior influences System Anomaly detection Anomaly analysis An anomaly is a deviation from “normal” system behavior Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 8 / 32

  15. Foundations Anomaly Detection Anomaly Detection (1/2) System System Normal system behavior: behavior influences System Static reference values Anomaly (e.g., mean response time over detection a day ≤ T ) Anomaly Analytical or statistical models analysis in dependence to system An anomaly is a deviation from influences and historical system “normal” system behavior behavior Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 8 / 32

  16. Foundations Anomaly Detection Anomaly Detection (2/2) Methods to create normal behavior profiles Manual specification Automatic profile learning from observations Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 9 / 32

  17. Foundations Anomaly Detection Anomaly Detection (2/2) Methods to create normal behavior profiles Manual specification Automatic profile learning from observations Challenges of anomaly detection: False alarms System usage Nonlinear system behavior, modeling uncertainties Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 9 / 32

  18. Foundations Anomaly Detection Anomaly Detection (2/2) Methods to create normal behavior profiles Manual specification Automatic profile learning from observations Challenges of anomaly detection: False alarms System usage Nonlinear system behavior, modeling uncertainties Typical application domains: Industrial manufacturing, large-scale control systems [Palade et al., 2006] Network management [Maxion, 1990] Intrusion detection (Security) [Denning, 1987] Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 9 / 32

  19. Foundations Software Performance Software Timing Behavior Influences to software timing behavior: System architecture: Hardware resource capacity Software design System usage : [cp. Sabetta and Koziolek, 2007]: Workload intensity (e.g., number of active users) Service demand characteristics (e.g., individual request parameters) System state Performance tuning (e.g., caching, load balancing), ... Server virtualization Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 10 / 32

  20. Creation of the timing behavior profile Outline Foundations 1 Creation of the timing behavior profile 2 Instrumentation Monitoring Analysis of Execution Sequences Analysis of Workload Intensity Fault Localization 3 Evaluation 4 Related work 5 Conclusions 6 Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 11 / 32

  21. Creation of the timing behavior profile Failure diagnosis through online timing behavior evaluation Timing behavior anomalies: Deviations from normal timing behavior (here: response times) of operations of a software system e.g., exceptional high or low response times Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 12 / 32

  22. Creation of the timing behavior profile Failure diagnosis through online timing behavior evaluation Timing behavior anomalies: Deviations from normal timing behavior (here: response times) of operations of a software system e.g., exceptional high or low response times Relation between software faults and timing behavior anomalies Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 12 / 32

  23. Creation of the timing behavior profile Failure diagnosis through online timing behavior evaluation Timing behavior anomalies: Deviations from normal timing behavior (here: response times) of operations of a software system e.g., exceptional high or low response times Relation between software faults and timing behavior anomalies: Software faults tend to cause timing behavior anomalies [Kao et al., 1993] Successful fault localization based on timing behavior anomalies [Agarwal et al., 2004] Response times in enterprise resource planning systems (ERP) are often log-normally distributed [Mielke, 2006] Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 12 / 32

  24. Creation of the timing behavior profile Overview Timing behavior anomaly detection for failure diagnosis Initial activities Creation of the Timing Instrumentation Monitoring timing behavior behavior for Monitoring profile profile Continuous activties Update of Timing timing behavior behavior Monitoring profile profile Activities during failure diagnosis Log: − response times − execution sequences Diagnosis Anomaly Anomaly report detection analysis Timing behavior profile Matthias Rohr (TrustSoft) Timing Behavior Anomaly Detection 10th of April 2007 13 / 32

Recommend


More recommend