the failure trace archive enabling comparative analysis
play

The Failure Trace Archive: Enabling Comparative Analysis of Diverse - PowerPoint PPT Presentation

The Failure Trace Archive: Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo 1 , Bahman Javadi 1 , Alexandru Iosup 2 , Dick Epema 2 2 TU Delft, The Netherlands 1 INRIA, France Motivation Push toward experimental


  1. The Failure Trace Archive: Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo 1 , Bahman Javadi 1 , Alexandru Iosup 2 , Dick Epema 2 2 TU Delft, The Netherlands 1 INRIA, France

  2. Motivation • Push toward experimental computer science

  3. Motivation • Push toward experimental computer science • Hard to evaluate and compare algorithms and models for fault-tolerance • Lack of public trace data sets • Lack of standard trace format • Lack of parsing and analytical tools

  4. Motivation • Push toward experimental computer science • Hard to evaluate and compare algorithms and models for fault-tolerance • Lack of public trace data sets • Lack of standard trace format • Lack of parsing and analytical tools • Failures in distributed systems have increasingly high negative impact and complex dynamics

  5. Failure Trace Archive (FTA) http://fta.inria.fr • Availability traces of distributed systems, differing in scale, volatility, and usage • Standard event-based format for failure traces • Scripts and tools for parsing and analyzing traces in svn repository

  6. Related Work Analysis Parsing Format Resource Data Sets Tools Tools Emphasis on Grid ✗ ✗ ✗ EGEE Observatory 12 (mainly Computer ✗ ✗ ✗ Failure Repo. clusters) 5 (mainly Repo. ✓ ✓ ✗ ✗ of Avail. Traces P2P) ✓ Desktop Grid 4 Desktop ✗ ✗ ✗ Archive Grids ✓ ✓ ✓ 22 FTA 1 1 FTA includes data sets of the former three resources, in addition to providing several new data sets

  7. Enabled Studies • Comparing models/algorithms using the identical data sets

  8. Enabled Studies • Comparing models/algorithms using the identical data sets • Evaluation of generality/specificity of model/algorithm across different types of systems

  9. Enabled Studies • Comparing models/algorithms using the identical data sets • Evaluation of generality/specificity of model/algorithm across different types of systems • Evaluation of the generality of a system trace

  10. Enabled Studies • Comparing models/algorithms using the identical data sets • Evaluation of generality/specificity of model/algorithm across different types of systems • Evaluation of the generality of a system trace • Analysis of evolution of failures over time

  11. Enabled Studies • Comparing models/algorithms using the identical data sets • Evaluation of generality/specificity of model/algorithm across different types of systems • Evaluation of the generality of a system trace • Analysis of evolution of failures over time • And many more...

  12. Contributions • Description of FTA, trace format and analysis toolbox • High-level statistical characterization of failures in each data set • Show importance of public data sets and methods via characterization of ambiguous data sets

  13. Background Definitions • Failure: observed deviation from correct system state • Availability (unavailability) interval: continuous period that system is in correct state (incorrect state) • Error: system state (not externally visible) that leads to failure • Fault: root cause of an error

  14. FTA Schema platform node node_perf component_type component creator codes event_trace event_state event_type codes event_end reason codes

  15. FTA Schema • Resource (versus job or user) centric platform node node_perf component_type component creator codes event_trace event_state event_type codes event_end reason codes

  16. FTA Schema • Resource (versus job or user) centric platform node node_perf component_type component creator codes event_trace event_state event_type codes event_end reason codes

  17. FTA Schema • Resource (versus job or user) centric platform node node_perf component_type component creator codes event_trace event_state event_type codes event_end reason codes

  18. FTA Schema • Resource (versus job or user) centric platform • Event-based node node_perf component_type component creator codes event_trace event_state event_type codes event_end reason codes

  19. FTA Schema • Resource (versus job or user) centric platform • Event-based node node_perf component_type component creator codes event_trace event_state event_type codes event_end reason codes

  20. FTA Schema • Resource (versus job or user) centric platform • Event-based node node_perf component_type component creator codes event_trace event_state event_type codes event_end reason codes

  21. FTA Schema • Resource (versus job or user) centric platform • Event-based • Associated metadata node node_perf component_type component creator codes event_trace event_state event_type codes event_end reason codes

  22. FTA Schema • Resource (versus job or user) centric platform • Event-based • Associated metadata node node_perf component_type component creator codes event_trace event_state event_type codes event_end reason codes

  23. FTA Schema • Resource (versus job or user) centric platform • Event-based • Associated metadata node node_perf component_type component creator codes event_trace event_state event_type codes event_end reason codes

  24. FTA Schema • Resource (versus job or user) centric platform • Event-based • Associated metadata node node_perf • Codes for different component_type components, events, component creator codes and errors event_trace event_state event_type codes event_end reason codes

  25. FTA Schema • Resource (versus job or user) centric platform • Event-based • Associated metadata node node_perf • Codes for different component_type components, events, component creator codes and errors event_trace event_state event_type codes event_end reason codes

  26. FTA Schema • Resource (versus job or user) centric platform • Event-based • Associated metadata node node_perf • Codes for different component_type components, events, component creator codes and errors event_trace event_state event_type codes event_end reason codes

  27. FTA Schema • Resource (versus job or user) centric platform • Event-based • Associated metadata node node_perf • Codes for different component_type components, events, component creator codes and errors event_trace event_state event_type codes • Balance between event_end reason codes completeness and sparseness

  28. FTA Schema • Resource (versus job or user) centric platform • Event-based • Associated metadata node node_perf • Codes for different component_type components, events, component creator codes and errors event_trace event_state event_type codes • Balance between event_end reason codes completeness and sparseness • Extensibility

  29. FTA Schema • Resource (versus job or user) centric platform • Event-based • Associated metadata node node_perf • Codes for different component_type components, events, component creator codes and errors event_trace event_state event_type codes • Balance between event_end reason codes completeness and sparseness • Raw, Tabbed, Relational database • Extensibility (MySQL)

  30. Data Quality Assessment • Syntactic: standard format library that checks data types, number fields (automated) • Semantic: time moves forward and is non- overlapping, state is valid (automated) • Visual: look at the distribution for outliers (manual)

  31. Data Sets • Usage (p2p, supercomputer, grids, desktop PC’s) • Type (CPU, network, IO) • Scale (50-240,000 hosts) • Volatility (minutes to days) • Resolution (wrt failure detection)

  32. Currently 21 Data Sets http://fta.inria.fr

  33. Currently 21 Data Sets http://fta.inria.fr

  34. Currently 21 Data Sets http://fta.inria.fr

  35. Currently 21 Data Sets http://fta.inria.fr

  36. Currently 21 Data Sets http://fta.inria.fr

  37. Currently 21 Data Sets http://fta.inria.fr

  38. Currently 21 Data Sets http://fta.inria.fr

  39. Currently 21 Data Sets http://fta.inria.fr

  40. Currently 21 Data Sets http://fta.inria.fr

  41. Currently 21 Data Sets http://fta.inria.fr

  42. Statistical Analysis

  43. FTA Toolbox MySQL trace database text html initialize query process finalize wiki latex • Makes it easy to run a set of statistical measures across all the data sets • Provides library of functions that can be reused and incorporated • Implemented in Matlab • svn checkout svn://scm.gforge.inria.fr/svn/fta/ toolbox

  44. Failure Modelling • Approach • Model availability and unavailability intervals, each with a single probability distribution • Assume availability and unavailability is identically and independently distributed • Descriptive, not prescriptive

  45. Distributions of Availability and Unavailability Intervals

  46. Distributions of Availability and Unavailability Intervals Qualitative Description

  47. Model Fitting • For each candidate probability distribution • Compute parameters that maximize the distribution’s likelihood • Measure goodness of fit using Kolomorov- Smirnov (KS) and Anderson-Darling (AD) tests • Compute p-value using 30 samples. Take average of 1000 p-values

  48. P-Values for KS & AD Goodness-of-fit tests Availability Unavailability

Recommend


More recommend