GautamAltekar and Ion Stoica University of California, Berkeley

Debugging datacenter software is really hard Datacenter software? Hard? Non-determinism  Large-scale,  Can’t reproduce data-intensive, failures distributed apps  Can’t cyclically debug How can we reproduce non-deterministic failures in datacenter software?

Generate replica of original run, hence failures Non-deterministic data Record Log file Replay Non-deterministic data (e.g., inputs, thread interleaving) Why deterministic replay?  Model checking, testing, verification  Goal: find errors pre-production  Can’t catch all errors  Can’t reproduce production failures

 Always-on production use  < 5% slowdown  Log no more than traditional console logs (100 Kbps)  High fidelity replay  Reproduce the most difficult of non-deterministic bugs

None suitable for the datacenter Always-on High fidelity operation? replay? FDR, Capo, No Yes CoreDet VMWare, Yes No PRES, ReSpec ODR, ESD, Yes No SherLog R2 Yes No

Build a Data Center Replay System Target Design for  Record efficiently  Large-scale, data- ~20% overhead, 100 intensive, KBps distributed apps  High replay fidelity  Replays difficult bugs  Linux/x86

 Overview  Approach  Testing the Hypothesis  Preliminary Results  Ongoing Work

For debugging , not necessary to produce identical run Often suffices to produce any run that has same control-plane behavior

Datacenter apps have two components 1. Control-plane code 2. Data-plane code Manages the data Processes the data Complicated, Low traffic Simple, High traffic  Distributed data placement  Checksum verification  Replica consistency  String matching

Relax guarantees to control-plane determinism Meet all requirements for a practical datacenter replay system

 Overview  Approach  Testing the Hypothesis  Preliminary Results  Ongoing Work

Experimentally show the control plane has: 1. Higher bug rates, by far  Most bugs must stem from control plane code  Implies high fidelity replay 2. Lower data rates, by far  Consumes and generates very little I/O  Implies low overhead recording

Data Plane Control Plane 99% 1% 99% 1% Data Rate Data Rate Bug Rate Bug Rate Evidence support the hypothesis

 Overview  Hypothesis  Testing the Hypothesis  How?  Preliminary Results  Ongoing Work

 To make statements about planes, we must first identify them  Goal: Classify code as control and data plane code  Hard: tied to program semantics  Obvious approach: Manually identify plane code  Error prone and unreliable

1. Manually identify user-data files  User data? E.g., file uploaded to HDFS 2. Automatically identify static instructions tainted by user data  Taint-flow analysis 3. Instructions tainted by user data are in data plane; others are in control plane

 Instruction-level  Works with apps written in arbitrary languages  Dynamic  Easier to get accurate results (e.g., in the presence of dynamically generated code)  Distributed  Avoids need to identify user-data entry points for each component

 It’s imprecise  We may have misidentified user data (unlikely)  We don’t propagate taint across tainted -pointer dereferences (to avoid false positives)  It’s incomplete  Dynamic analysis often has low code coverage  Results do not generalize to arbitrary executions

 Overview  Hypothesis  Testing the Hypothesis  Evaluation  Ongoing Work

 Distributed applications  Hypertable: Key-value store  KFS/CloudStore: Filesystem  OpenSSH (scp): Secure file transfer  Configuration  1 client, 1 of each system node  10 GB user-data file  Kept simple to ease understanding

 Bug rates  Indirect: code size (static x86 instructions executed)  Direct: Bug-report count (Bugzilla)  Data rates  Fraction of total I/O

 Overview  Hypothesis  Testing the Hypothesis  Evaluation  OpenSSH  Ongoing Work

OpenSSH: Executed Static Instructions Control (%) Data (%) Total (K) Agent 100 0 11 Server 97.8 2.2 103 Client (scp) 98.9 1.1 69 Average 98.9 1.1 61 Even components that touch user-data are almost exclusively control plane

OpenSSH: Bugzilla Report Count Control (%) Data (%) Total Agent 100 0 2 Server 100 0 215 Client (scp) 99 1 153 Average 99.7 0.3 123 Control plane is the most error-prone, even in components that touch user-data

(1) Control plane executes many functions to perform its core tasks OpenSSH: # of functions hosting top 90% of dynamic instructions Control Data Agent 13 0 Most active data plane functions: Server 100 1 aes_encrypt() and Client 27 1 aes_decrypt() (scp) Average 47 1

(2) Control plane relies heavily of custom code OpenSSH: % of Dynamic Instructions Issued from Libraries Control Data (%) (%) Data plane often relies Agent 82.7 0 on well-tested libraries Server 93.6 99.6 (e.g., libc, libcrypto, etc.) Client 96.2 100 (scp) Average 90.8 99.8

What should I say here? Control (%) Data (%) Total (GB) Agent 100 0 0.001 Server 0.8 99.2 20.2 Client (scp) 0.6 99.4 20.2

 How well do results generalize?  To other code paths  To other applications  How do we achieve control plane determinism?  Should we just ignore the data plane?  Should we use inference techniques?

What have we argued? Control-plane determinism enables record- efficient, high-fidelity datacenter replay What’s next? More application data points Questions?

GautamAltekar and Ion Stoica University of California, Berkeley - PowerPoint PPT Presentation

GautamAltekar and Ion Stoica University of California, Berkeley Debugging datacenter software is really hard Datacenter software? Hard? Non-determinism Large-scale, Cant reproduce data-intensive, failures distributed apps

I nt roduct ion t o Lab 2 I nt roduct ion t o Lab 2 I nt roduct ion t o Lab 2 I nt roduct ion t

QoS Services with Dynamic Packet State Ion Stoica Carnegie Mellon University (joint work with

Its Not the Cost, Its the Quality! Ion Stoica Conviva Networks and UC Berkeley 1 A Brief

Trends and Challenges in Big Data Ion Stoica November 14, 2016 PDSW-DISCS16 PDSW-DISCS16

Pathlet Routing P. Brighten Godfrey, Igor Ganichev, Scott Shenker, and Ion Stoica

Minimizing Churn in Distributed Systems Brighten Godfrey Scott Shenker Ion Stoica SIGCOMM 2006

Pathlet Routing Brighten Godfrey Scott Shenker Ion Stoica {pbg,shenker,istoica}@cs.berkeley.edu

Understanding the Impact of Video Quality on User Engagement Florin Dobrian Vyas Sekar Ion Stoica

CS 268: Lecture 19 (Application Level Multicast) Ion Stoica March 22, 2001 (* Thanks to

CS294: RISE Logistics, Overview, Trends Joey Gonzalez, Joe Hellerstein, Raluca Popa, Ion Stoica

Understanding the Impact of Video Quality on User Engagement Florin Dobrian Vyas Sekar Ion Stoica

Plasmacluster Ion Generator Plasmacluster Ion Generator A Revolution in Air Treatment Natures

ION RIT POWE RPOINT PRE SE NT AT ION SUBMISSION PRE SE NT AT ION GUIDE L INE S

LITHIUM ION IN MATERIALS HANDLING LITHIUM ION IN MATERIALS HANDLING LITHIUM ION IN WAREHOUSE

Curriculum on Citizenship California Basics California Basics Agenda A1. California

S OCIAL INCLUS ION S OCIAL INCLUS ION S ocial inclusion is t he realizat ion of the

Introduction Measure properties of unitarity triangle to test CKM mechanism: 2 sides, 3 angles

08 Your shell and working remotely CS 2043: Unix Tools and Scripting, Spring 2019 [1] Matthew

The Long and Winding Path to Secure Implementation of GlobalPlatform SCP10 Daniel De Almeida

Computer Networks Project 2 & HW 1 By Qian Yan (qiany7@) 1 Software Defined Network (SDN)

Scripting with Bash Compact Course @ MPE Moritz August March 14 - 16, 2017 Moritz August:

Introduction to HPC2N Birgitte Bryds HPC2N, Ume a University 4-5 December 2019 1 / 21

Migrating Files from HPSS Brian Vanderwende CISL Consulting Services December 4, 2019 This

Correcting Administrative Errors in DC Plans Jane Armstrong, Esq., Phelps Dunbar LLP Jane

GautamAltekar and Ion Stoica University of California, Berkeley - PowerPoint PPT Presentation

GautamAltekar and Ion Stoica University of California, Berkeley Debugging datacenter software is really hard Datacenter software? Hard? Non-determinism Large-scale, Cant reproduce data-intensive, failures distributed apps

I nt roduct ion t o Lab 2 I nt roduct ion t o Lab 2 I nt roduct ion t o Lab 2 I nt roduct ion t

QoS Services with Dynamic Packet State Ion Stoica Carnegie Mellon University (joint work with

Its Not the Cost, Its the Quality! Ion Stoica Conviva Networks and UC Berkeley 1 A Brief

Trends and Challenges in Big Data Ion Stoica November 14, 2016 PDSW-DISCS16 PDSW-DISCS16

Pathlet Routing P. Brighten Godfrey, Igor Ganichev, Scott Shenker, and Ion Stoica

Minimizing Churn in Distributed Systems Brighten Godfrey Scott Shenker Ion Stoica SIGCOMM 2006

Pathlet Routing Brighten Godfrey Scott Shenker Ion Stoica {pbg,shenker,istoica}@cs.berkeley.edu

Understanding the Impact of Video Quality on User Engagement Florin Dobrian Vyas Sekar Ion Stoica

CS 268: Lecture 19 (Application Level Multicast) Ion Stoica March 22, 2001 (* Thanks to

CS294: RISE Logistics, Overview, Trends Joey Gonzalez, Joe Hellerstein, Raluca Popa, Ion Stoica

Understanding the Impact of Video Quality on User Engagement Florin Dobrian Vyas Sekar Ion Stoica

Plasmacluster Ion Generator Plasmacluster Ion Generator A Revolution in Air Treatment Natures

ION RIT POWE RPOINT PRE SE NT AT ION SUBMISSION PRE SE NT AT ION GUIDE L INE S

LITHIUM ION IN MATERIALS HANDLING LITHIUM ION IN MATERIALS HANDLING LITHIUM ION IN WAREHOUSE

Curriculum on Citizenship California Basics California Basics Agenda A1. California

S OCIAL INCLUS ION S OCIAL INCLUS ION S ocial inclusion is t he realizat ion of the

Introduction Measure properties of unitarity triangle to test CKM mechanism: 2 sides, 3 angles

08 Your shell and working remotely CS 2043: Unix Tools and Scripting, Spring 2019 [1] Matthew

The Long and Winding Path to Secure Implementation of GlobalPlatform SCP10 Daniel De Almeida

Computer Networks Project 2 &amp; HW 1 By Qian Yan (qiany7@) 1 Software Defined Network (SDN)

Scripting with Bash Compact Course @ MPE Moritz August March 14 - 16, 2017 Moritz August:

Introduction to HPC2N Birgitte Bryds HPC2N, Ume a University 4-5 December 2019 1 / 21

Migrating Files from HPSS Brian Vanderwende CISL Consulting Services December 4, 2019 This

Correcting Administrative Errors in DC Plans Jane Armstrong, Esq., Phelps Dunbar LLP Jane

Computer Networks Project 2 & HW 1 By Qian Yan (qiany7@) 1 Software Defined Network (SDN)