automatic failure diagnosis support in distributed large
play

Automatic Failure Diagnosis Support in Distributed Large-Scale - PowerPoint PPT Presentation

Automatic Failure Diagnosis Support in Distributed Large-Scale Software Systems based on Timing Behavior Anomaly Correlation Presentation at 13th European Conference on Software Maintenance and Reengineering Nina Marwede 1 , Matthias Rohr 1 ,


  1. Automatic Failure Diagnosis Support in Distributed Large-Scale Software Systems based on Timing Behavior Anomaly Correlation Presentation at 13th European Conference on Software Maintenance and Reengineering Nina Marwede 1 , Matthias Rohr 1 , André van Hoorn 2 , Wilhelm Hasselbring 3 1 BTC Business Technology Consulting AG, Germany 2 Graduate School TrustSoft, University of Oldenburg, Germany 3 Software Engineering Group, University of Kiel, Germany Contact: matthias.rohr@btc-ag.com March 25, 2009 Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 1 / 25

  2. Motivation Motivation Complex Users Software System Complex software systems are almost never free of faults. Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 2 / 25

  3. Motivation Motivation Administrators Complex Users Software System Complex software systems are almost never free of faults. Software faults are a major cause for system failures [Küng and Krause, 2007; Gray, 1986] Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 2 / 25

  4. Motivation Motivation Administrators Complex Users Software System Complex software systems are almost never free of faults. Software faults are a major cause for system failures [Küng and Krause, 2007; Gray, 1986] Manual failure diagnosis is time-consuming and error-prone. Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 2 / 25

  5. Motivation Motivation Administrators Complex Users Software System Complex software systems are almost never free of faults. Software faults are a major cause for system failures [Küng and Krause, 2007; Gray, 1986] Manual failure diagnosis is time-consuming and error-prone. Huge amount of program states (space and time) [Cleve and Zeller, 2005] Temporal & spatial chasms between cause and symptom [Eisenstadt, 1997] Many systems are not known completely by a single person Some failure are hard to repeat – e.g., Heisenbugs Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 2 / 25

  6. Motivation Motivation Administrators Complex Users Software System Complex software systems are almost never free of faults. Software faults are a major cause for system failures [Küng and Krause, 2007; Gray, 1986] Manual failure diagnosis is time-consuming and error-prone. Most common failure diagnosis methods [Eisenstadt, 1997]: Data-gathering (e.g., print-statements to source code, memory dumps) Interactive execution using debugging tools Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 2 / 25

  7. Motivation Motivation Strategy to support failure diagnosis Runtime behavior is indicative for failures and error-propagation. Automatic fault localization using anomaly detection on monitoring data. Analysis and visualization in the context of automatically derived architecture models. Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 2 / 25

  8. Foundations Outline Motivation 1 Foundations 2 Approach 3 Case Study 4 Summary & Conclusions 5 Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 3 / 25

  9. Foundations Online failure diagnosis based on anomaly detection Anomalies System System Anomalies are deviations from influences behavior System normal system behavior. Anomaly detection ‘ Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 4 / 25

  10. Foundations Online failure diagnosis based on anomaly detection Anomalies System System Anomalies are deviations from influences behavior System normal system behavior. Anomaly detection ‘ Fault localization activities Anomaly Detection Component Anomaly Correlation Anomaly detection (often plain aggregation) Visualization and/or reporting Component Anomaly detection Component Anomaly detection Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 4 / 25

  11. Foundations Propagation and Anomaly Detection Error propagation Fault (System Service) Error (dormant / active) Failure System <<Component>> <<Component>> :B :A ... Many errors propagate along calling dependencies . Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 5 / 25

  12. Foundations Propagation and Anomaly Detection Error propagation Fault (System Service) Error (dormant / active) Failure System <<Component>> <<Component>> :B :A ... Many errors propagate along calling dependencies . Anomaly correlation Anomalies propagate as well - compensating analysis is required. Some approaches analyze anomalies in context of calling dependency graphs . Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 5 / 25

  13. Foundations Dependency Graphs Calling Dependency Graphs $ Nodes: E.g., Operations, 250 Components, Deployment contexts, Virtual Machines ActionServlet Directed edges represent call actions 210 113 Weights quantify call frequencies CatalogBean CartBean ... ... Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 6 / 25

  14. Approach Contents Motivation 1 Foundations 2 Approach 3 Case Study 4 Summary & Conclusions 5 Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 7 / 25

  15. Approach Overview Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 8 / 25

  16. Approach Input Data A B C Calling dependencies 1 between operations Comp VM Start RT Anomaly ___________________________ ... A X 0001 8 0.6 Anomalies scores provided by 2 C Y 0002 1 −0.2 a timing behavior anomaly detector B X 0004 4 0.9 C Y 0006 2 0.3 ... Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 9 / 25

  17. Approach Architectural model creation Calling Dependency Graph (class granularity) for iBatis JPetStore $ 14149 ActionServlet 44561 14855 704 RequestProcessor 148530 29055 36800 177 14911 17518 2662 1 14842 3737 OrderBean 1654 20719 CatalogBean ActionMessage ActionMessages 129 ActionForm DaoConfig Action ActionMapping 994 1349 222 CartBean 24 343 330 AccountBean 334 16187 ActionMessages$1 177 33545 1576 4 399 7319 374 66811 3908 444 9082 5171 OrderService 328 AbstractBean CatalogService 6856 100 320 ActionMessages$ActionMessageItem 36089 343 328 344 15 AccountService 6823 7406 3908 111341 1654 367 OrderSqlMapDao SequenceSqlMapDao 32 ItemSqlMapDao 13180 399 ProductSqlMapDao CategorySqlMapDao 18 13341 1422 2296 3362 1094 33224 80 72180 11724 2143 18 Order 2252 Sequence LineItem 32 AccountSqlMapDao ProductSqlMapDao$ProductSearch Product Category 1332 11322 8217 Cart 2291 4504 130484 Account 10656 2252 32 5306 CartItem 2398 4796 Item Two alternative methods for creating the CDG: Analysis of monitoring data Static (source code) analysis Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 10 / 25

  18. Approach Aggregation and integration into the architectural model Approach Each architectural element’s anomaly scores are aggregated into a single value Several metrics explored (mean, median, power mean, ...) Number of 0.2 executions Anomaly score The aggregation reduces the complexity for the correlation activity Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 11 / 25

  19. Approach Aggregation and integration into the architectural model Approach Each architectural element’s anomaly scores are aggregated into a single value Several metrics explored (mean, median, power mean, ...) Number of 0.2 executions Anomaly score The aggregation reduces the complexity for the correlation activity Example result: Three operations with assigned anomaly scores Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 11 / 25

  20. Approach Correlation of anomaly ratings Approach Rules are applied that recompute an elements anomaly score in the context of its callers and callees Similar approach to cellular automaton The rules encapsulate error and anomaly propagation knowledge Example scenario: Is A’s anomaly score just the result of a fault in B? Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 12 / 25

  21. Approach Correlation of anomaly ratings Approach Rules are applied that recompute an elements anomaly score in the context of its callers and callees Similar approach to cellular automaton The rules encapsulate error and anomaly propagation knowledge Example scenario: Is A’s anomaly score just the result of a fault in B? Matthias Rohr (BTC AG) Failure Diagnosis based on Timing Behavior 25.03.2009 CSMR Kaiserslautern 12 / 25

Recommend


More recommend