the failure trace archive enabling comparative analysis

The Failure Trace Archive: Enabling Comparative Analysis of Diverse - PowerPoint PPT Presentation

The Failure Trace Archive: Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo 1 , Bahman Javadi 1 , Alexandru Iosup 2 , Dick Epema 2 2 TU Delft, The Netherlands 1 INRIA, France Motivation Push toward experimental

  1. The Failure Trace Archive: Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo 1 , Bahman Javadi 1 , Alexandru Iosup 2 , Dick Epema 2 2 TU Delft, The Netherlands 1 INRIA, France

  2. Motivation • Push toward experimental computer science

  3. Motivation • Push toward experimental computer science • Hard to evaluate and compare algorithms and models for fault-tolerance • Lack of public trace data sets • Lack of standard trace format • Lack of parsing and analytical tools

  4. Motivation • Push toward experimental computer science • Hard to evaluate and compare algorithms and models for fault-tolerance • Lack of public trace data sets • Lack of standard trace format • Lack of parsing and analytical tools • Failures in distributed systems have increasingly high negative impact and complex dynamics

  5. Failure Trace Archive (FTA) • Availability traces of distributed systems, differing in scale, volatility, and usage • Standard event-based format for failure traces • Scripts and tools for parsing and analyzing traces in svn repository

  6. Related Work Analysis Parsing Format Resource Data Sets Tools Tools Emphasis on Grid ✗ ✗ ✗ EGEE Observatory 12 (mainly Computer ✗ ✗ ✗ Failure Repo. clusters) 5 (mainly Repo. ✓ ✓ ✗ ✗ of Avail. Traces P2P) ✓ Desktop Grid 4 Desktop ✗ ✗ ✗ Archive Grids ✓ ✓ ✓ 22 FTA 1 1 FTA includes data sets of the former three resources, in addition to providing several new data sets

  7. Enabled Studies • Comparing models/algorithms using the identical data sets

  8. Enabled Studies • Comparing models/algorithms using the identical data sets • Evaluation of generality/specificity of model/algorithm across different types of systems

  9. Enabled Studies • Comparing models/algorithms using the identical data sets • Evaluation of generality/specificity of model/algorithm across different types of systems • Evaluation of the generality of a system trace

  10. Enabled Studies • Comparing models/algorithms using the identical data sets • Evaluation of generality/specificity of model/algorithm across different types of systems • Evaluation of the generality of a system trace • Analysis of evolution of failures over time

  11. Enabled Studies • Comparing models/algorithms using the identical data sets • Evaluation of generality/specificity of model/algorithm across different types of systems • Evaluation of the generality of a system trace • Analysis of evolution of failures over time • And many more...

  12. Contributions • Description of FTA, trace format and analysis toolbox • High-level statistical characterization of failures in each data set • Show importance of public data sets and methods via characterization of ambiguous data sets

  13. Background Definitions • Failure: observed deviation from correct system state • Availability (unavailability) interval: continuous period that system is in correct state (incorrect state) • Error: system state (not externally visible) that leads to failure • Fault: root cause of an error

  14. FTA Schema platform node node_perf component_type component creator codes event_trace event_state event_type codes event_end reason codes

  15. FTA Schema • Resource (versus job or user) centric platform node node_perf component_type component creator codes event_trace event_state event_type codes event_end reason codes

  16. FTA Schema • Resource (versus job or user) centric platform node node_perf component_type component creator codes event_trace event_state event_type codes event_end reason codes

  17. FTA Schema • Resource (versus job or user) centric platform node node_perf component_type component creator codes event_trace event_state event_type codes event_end reason codes

  18. FTA Schema • Resource (versus job or user) centric platform • Event-based node node_perf component_type component creator codes event_trace event_state event_type codes event_end reason codes

  19. FTA Schema • Resource (versus job or user) centric platform • Event-based node node_perf component_type component creator codes event_trace event_state event_type codes event_end reason codes

  20. FTA Schema • Resource (versus job or user) centric platform • Event-based node node_perf component_type component creator codes event_trace event_state event_type codes event_end reason codes

  21. FTA Schema • Resource (versus job or user) centric platform • Event-based • Associated metadata node node_perf component_type component creator codes event_trace event_state event_type codes event_end reason codes

  22. FTA Schema • Resource (versus job or user) centric platform • Event-based • Associated metadata node node_perf component_type component creator codes event_trace event_state event_type codes event_end reason codes

  23. FTA Schema • Resource (versus job or user) centric platform • Event-based • Associated metadata node node_perf component_type component creator codes event_trace event_state event_type codes event_end reason codes

  24. FTA Schema • Resource (versus job or user) centric platform • Event-based • Associated metadata node node_perf • Codes for different component_type components, events, component creator codes and errors event_trace event_state event_type codes event_end reason codes

  25. FTA Schema • Resource (versus job or user) centric platform • Event-based • Associated metadata node node_perf • Codes for different component_type components, events, component creator codes and errors event_trace event_state event_type codes event_end reason codes

  26. FTA Schema • Resource (versus job or user) centric platform • Event-based • Associated metadata node node_perf • Codes for different component_type components, events, component creator codes and errors event_trace event_state event_type codes event_end reason codes

  27. FTA Schema • Resource (versus job or user) centric platform • Event-based • Associated metadata node node_perf • Codes for different component_type components, events, component creator codes and errors event_trace event_state event_type codes • Balance between event_end reason codes completeness and sparseness

  28. FTA Schema • Resource (versus job or user) centric platform • Event-based • Associated metadata node node_perf • Codes for different component_type components, events, component creator codes and errors event_trace event_state event_type codes • Balance between event_end reason codes completeness and sparseness • Extensibility

  29. FTA Schema • Resource (versus job or user) centric platform • Event-based • Associated metadata node node_perf • Codes for different component_type components, events, component creator codes and errors event_trace event_state event_type codes • Balance between event_end reason codes completeness and sparseness • Raw, Tabbed, Relational database • Extensibility (MySQL)

  30. Data Quality Assessment • Syntactic: standard format library that checks data types, number fields (automated) • Semantic: time moves forward and is non- overlapping, state is valid (automated) • Visual: look at the distribution for outliers (manual)

  31. Data Sets • Usage (p2p, supercomputer, grids, desktop PC’s) • Type (CPU, network, IO) • Scale (50-240,000 hosts) • Volatility (minutes to days) • Resolution (wrt failure detection)

  32. Currently 21 Data Sets

  33. Currently 21 Data Sets

  34. Currently 21 Data Sets

  35. Currently 21 Data Sets

  36. Currently 21 Data Sets

  37. Currently 21 Data Sets

  38. Currently 21 Data Sets

  39. Currently 21 Data Sets

  40. Currently 21 Data Sets

  41. Currently 21 Data Sets

  42. Statistical Analysis

  43. FTA Toolbox MySQL trace database text html initialize query process finalize wiki latex • Makes it easy to run a set of statistical measures across all the data sets • Provides library of functions that can be reused and incorporated • Implemented in Matlab • svn checkout svn:// toolbox

  44. Failure Modelling • Approach • Model availability and unavailability intervals, each with a single probability distribution • Assume availability and unavailability is identically and independently distributed • Descriptive, not prescriptive

  45. Distributions of Availability and Unavailability Intervals

  46. Distributions of Availability and Unavailability Intervals Qualitative Description

  47. Model Fitting • For each candidate probability distribution • Compute parameters that maximize the distribution’s likelihood • Measure goodness of fit using Kolomorov- Smirnov (KS) and Anderson-Darling (AD) tests • Compute p-value using 30 samples. Take average of 1000 p-values

  48. P-Values for KS & AD Goodness-of-fit tests Availability Unavailability


More recommend