The Failure Trace Archive: Enabling Comparative Analysis of Diverse - PowerPoint PPT Presentation

The Failure Trace Archive: Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo 1 , Bahman Javadi 1 , Alexandru Iosup 2 , Dick Epema 2 2 TU Delft, The Netherlands 1 INRIA, France

Motivation • Push toward experimental computer science

Motivation • Push toward experimental computer science • Hard to evaluate and compare algorithms and models for fault-tolerance • Lack of public trace data sets • Lack of standard trace format • Lack of parsing and analytical tools

Motivation • Push toward experimental computer science • Hard to evaluate and compare algorithms and models for fault-tolerance • Lack of public trace data sets • Lack of standard trace format • Lack of parsing and analytical tools • Failures in distributed systems have increasingly high negative impact and complex dynamics

Failure Trace Archive (FTA) http://fta.inria.fr • Availability traces of distributed systems, differing in scale, volatility, and usage • Standard event-based format for failure traces • Scripts and tools for parsing and analyzing traces in svn repository

Related Work Analysis Parsing Format Resource Data Sets Tools Tools Emphasis on Grid ✗ ✗ ✗ EGEE Observatory 12 (mainly Computer ✗ ✗ ✗ Failure Repo. clusters) 5 (mainly Repo. ✓ ✓ ✗ ✗ of Avail. Traces P2P) ✓ Desktop Grid 4 Desktop ✗ ✗ ✗ Archive Grids ✓ ✓ ✓ 22 FTA 1 1 FTA includes data sets of the former three resources, in addition to providing several new data sets

Enabled Studies • Comparing models/algorithms using the identical data sets

Enabled Studies • Comparing models/algorithms using the identical data sets • Evaluation of generality/specificity of model/algorithm across different types of systems

Enabled Studies • Comparing models/algorithms using the identical data sets • Evaluation of generality/specificity of model/algorithm across different types of systems • Evaluation of the generality of a system trace

Enabled Studies • Comparing models/algorithms using the identical data sets • Evaluation of generality/specificity of model/algorithm across different types of systems • Evaluation of the generality of a system trace • Analysis of evolution of failures over time

Enabled Studies • Comparing models/algorithms using the identical data sets • Evaluation of generality/specificity of model/algorithm across different types of systems • Evaluation of the generality of a system trace • Analysis of evolution of failures over time • And many more...

Contributions • Description of FTA, trace format and analysis toolbox • High-level statistical characterization of failures in each data set • Show importance of public data sets and methods via characterization of ambiguous data sets

Background Definitions • Failure: observed deviation from correct system state • Availability (unavailability) interval: continuous period that system is in correct state (incorrect state) • Error: system state (not externally visible) that leads to failure • Fault: root cause of an error

FTA Schema platform node node_perf component_type component creator codes event_trace event_state event_type codes event_end reason codes

FTA Schema • Resource (versus job or user) centric platform node node_perf component_type component creator codes event_trace event_state event_type codes event_end reason codes

FTA Schema • Resource (versus job or user) centric platform • Event-based node node_perf component_type component creator codes event_trace event_state event_type codes event_end reason codes

FTA Schema • Resource (versus job or user) centric platform • Event-based • Associated metadata node node_perf component_type component creator codes event_trace event_state event_type codes event_end reason codes

FTA Schema • Resource (versus job or user) centric platform • Event-based • Associated metadata node node_perf • Codes for different component_type components, events, component creator codes and errors event_trace event_state event_type codes event_end reason codes

FTA Schema • Resource (versus job or user) centric platform • Event-based • Associated metadata node node_perf • Codes for different component_type components, events, component creator codes and errors event_trace event_state event_type codes • Balance between event_end reason codes completeness and sparseness

FTA Schema • Resource (versus job or user) centric platform • Event-based • Associated metadata node node_perf • Codes for different component_type components, events, component creator codes and errors event_trace event_state event_type codes • Balance between event_end reason codes completeness and sparseness • Extensibility

FTA Schema • Resource (versus job or user) centric platform • Event-based • Associated metadata node node_perf • Codes for different component_type components, events, component creator codes and errors event_trace event_state event_type codes • Balance between event_end reason codes completeness and sparseness • Raw, Tabbed, Relational database • Extensibility (MySQL)

Data Quality Assessment • Syntactic: standard format library that checks data types, number fields (automated) • Semantic: time moves forward and is non- overlapping, state is valid (automated) • Visual: look at the distribution for outliers (manual)

Data Sets • Usage (p2p, supercomputer, grids, desktop PC’s) • Type (CPU, network, IO) • Scale (50-240,000 hosts) • Volatility (minutes to days) • Resolution (wrt failure detection)

Currently 21 Data Sets http://fta.inria.fr

Statistical Analysis

FTA Toolbox MySQL trace database text html initialize query process finalize wiki latex • Makes it easy to run a set of statistical measures across all the data sets • Provides library of functions that can be reused and incorporated • Implemented in Matlab • svn checkout svn://scm.gforge.inria.fr/svn/fta/ toolbox

Failure Modelling • Approach • Model availability and unavailability intervals, each with a single probability distribution • Assume availability and unavailability is identically and independently distributed • Descriptive, not prescriptive

Distributions of Availability and Unavailability Intervals

Distributions of Availability and Unavailability Intervals Qualitative Description

Model Fitting • For each candidate probability distribution • Compute parameters that maximize the distribution’s likelihood • Measure goodness of fit using Kolomorov- Smirnov (KS) and Anderson-Darling (AD) tests • Compute p-value using 30 samples. Take average of 1000 p-values

P-Values for KS & AD Goodness-of-fit tests Availability Unavailability

The Failure Trace Archive: Enabling Comparative Analysis of Diverse - PowerPoint PPT Presentation

The Failure Trace Archive: Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo 1 , Bahman Javadi 1 , Alexandru Iosup 2 , Dick Epema 2 2 TU Delft, The Netherlands 1 INRIA, France Motivation Push toward experimental

Archive Presentation The Description of the Future Pharmaceutical Archive The Archive Context

WP3 EX-POST Case studies Comparative Analysis Report Deliverable no.: 3.2 Comparative Analysis

ESO Science Archive: 1D spectra publishing process ESO archive evolving from raw to science-ready

Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - Winter 2009 Trace Caches

Health Failure Telehealth Final Report Sarah Briggs Heart Failure Specialist Nurse Heart Failure

A Grid Research Toolbox The Failure Trace Archive DGSim A. Iosup, O. Sonmez, N. Yigitbasi, M.

WHAT ORIGIN OF THE ARCHIVE FEE 1 DISTRICT COURT ARCHIVE FEE Government Code Section 51.305(b)

Migrating The Language Archive to a new repository solution PAUL TRILSBEEK MAX PLANCK INSTITUTE

A Dublin Core Application Profile for the digital Pina Bausch Archive Kerstin Diwisch Bernhard

Failure is a four-letter word Andreas Zeller Thomas Zimmermann Christian Bird PROMISE

Comparative Genomics: Comparative Genomics: Sequence, Structure, Sequence, Structure, and

Our Hobbies 1B Cindy Chan Trace Chan Yuki Lo All: Good morning ,everybody. Cindy: I am Cindy

Trace Elements in igneous petrology Abundances of trace elements are used to test petrogenetic

Trace and center of the twisted Heisenberg category Michael Reeks June 4, 2018 Michael Reeks

Assessing the Performance of MPI Applications Through Time-Independent Trace Replay . Desprez 1

DIV 26000 AND HEAT TRACE FOR MECHANICAL SYSTEMS ACE/ASM DOS AND DONTS OF HEAT TRACE IN

Data Analytics A (Short) Tour Venkatesh-Prasad Ranganath http://about.me/rvprasad Click to edit

A tool for Bottleneck analysis and Performance Prediction for GPU-accelerated Applications S.

TheScienceofComputingand theEngineeringofSoftware TonyHoare

BLAG: Improving the Accuracy of Blacklists Sivaram Ramanathan 1 , Jelena Mirkovic 1 and Minlan Yu

Weak Truth Table Degrees of Structures David Belanger 1 April 2012 at UWMadison EMAIL :

Local Generic Formal Fibers of Excellent Rings Williams College SMALL REU 2013 Commutative

SPECT MRI 1 02/05/16 PET Bone scin1graphy 99m Tm 2 02/05/16

Overview of RMD Silicon Semiconductor Detector Activities Mickel McClish Radiation Monitoring