Introducing Data Downtime: From Firefighting to Winning Barr Moses

When your CEO or customer says the data is wrong...

Unreliable data is more prevalent than we admit “Our data is 100% reliable.” Said no one ever

Current solutions take heroic (or tedious) efforts “We’ve been building out data monitoring solutions for the last year and “Each data engineer FINALLY we’re in a good adds monitoring and place.” basic checks to their pipelines, but it’s all Engineering Manager, 300 manual.” person tech co Engineering Manager, 1500 person tech co

“Data downtime” refers to periods of time when your data is partial, erroneous, missing or otherwise inaccurate.

Signs you’re experiencing data downtime ● Your data team spends >10% of time on fire drills ● Your company lost $$ because data was broken ● Critical analysis fails because missing data went unnoticed ● Troubleshooting involves tedious step-by-step debugging

What can we do about it? If you care about the data industry, make data downtime mitigation part of your culture - Team SLAs - Weekly meetings - Post-mortems - Pay data debt

World-class companies mitigate data downtime Reactive Proactive Automated Scalable Manual tracking & Firefighting & Programmatic Substantial, crises mode detection enforcement embedded coverage

Automa- Reactive Proactive Scalable ted Examples: ● Generate the data more frequently than you Proactive need it ● Manually write data QA queries and alert on it Manual tracking & ● Track time stamps to ensure data isn’t stale detection ● Validate row counts in critical stages of pipeline

Example #1: Validate row counts in critical stages of pipeline Data Data Source Warehouse ... Job #1 Job #2 Job #N Validate number of rows for key tables (e.g. website visitors) is within reasonable range Consider failing the job vs. generating a warning/alert

Example #1: Validate row counts in critical stages of pipeline Data Data Source Warehouse ... Job #1 Job #2 Job #N Metrics Reporting, Time-series alerting DB

Example #1: Validate row counts in critical stages of pipeline ● Particularly effective when ○ Row count is somewhat predictable ○ Problems tend to result in substantial missing or duplicated data ● Limitations ○ Could require tuning and maintenance ○ Sensitive to seasonality ○ Doesn’t catch all issues

Automa- Reactive Proactive Scalable ted Examples: ● Automatically track metrics about dimensions & Automated measures, compare to past periods ● Create a data health dashboard Programmatic ● Document fields and tables enforcement ● Monitor/enforce schema and validity of upstream

Example #2: Monitor/enforce schema and validity of upstream ● Why ○ Data sources change (e.g. engineering pushes schema change) ○ Data team does not control and has no visibility ● How ○ Define expected schema ○ Validate data against schema before processing ○ Discard or alert on data that doesn’t match

Example #2: Monitor/enforce schema and validity of upstream

Example #2: Monitor/enforce schema and validity of upstream ● Bonus: validate data, not just schema (e.g. great expectations open source) Source: https://github.com/great-expectations/great_expectations

Example #2: Monitor/enforce schema and validity of upstream ● Bonus: automate ownership & page the right owner (e.g. pagerduty alerts)

Automa- Reactive Proactive Scalable ted Examples: ● Validate measures across different tables, sources Scalable ● Track and annotate data issues and questions ● Create reusable components to calculate metrics Substantial, ● Setup a staging environment that closely resembles embedded production coverage ● Embed automated data validation code across your pipeline

Example #3: Embed automated data validation code across your pipeline

Example #3: Embed automated data validation code across your pipeline Case in Point: Netflix’s RAD ● Need to validate 10000s of metrics, with seasonality ● RPCA algorithm to detect time series anomalies ● Pig wrapper allows pipeline engineers to validate key metrics ● Success with (a) transaction data per banking institution (b) sign up conversion by country and browser Source: https://medium.com/netflix-techblog/rad-outlier-detection-on-big-data-d6b0494371cc

World-class companies mitigate data downtime Reactive Proactive Automated Scalable Manual tracking & Firefighting & Programmatic Substantial, crises mode detection enforcement embedded coverage

Your company will kickass if you figure this out ● Minimize fire drills ● Gain confidence ● Make better decisions ● Move faster

Join the data downtime movement. Connect with me @barrmoses on Medium or LinkedIn

Introducing Data Downtime: From Firefighting to Winning Barr Moses - PowerPoint PPT Presentation

Introducing Data Downtime: From Firefighting to Winning Barr Moses When your CEO or customer says the data is wrong... Unreliable data is more prevalent than we admit Our data is 100% reliable. Said no one ever Current solutions take

Civil Defense and Firefighting Vehicles Civil Defense and Firefighting Vehicles Firefighting

Geometric Firefighting Rolf Klein University of Bonn HMI, June 19, 2018 Rolf Klein Geometric

Clinical Documentation during ieMR Downtime Call Ext 8800 for support during ieMR Downtime

Zero Downtime Deployment with Ansible Zero Downtime Deployment with Ansible DevOps Pro Moscow

Matthew Ryan The Fire Su Surgery Ltd Fire Safety Engineering: Firefighting provisions Overview

Company pany Pre Present entation ion C.S.I. s.r.l. is firefighting equipment manufacturer,

CAN in FireFighting Vehicles Dipl.Ing. Oliver Hrazdera Table of content Market

FireSim A Virtual Firefighting Simulator Nils Schmeier Campus Fire Brigade Rossendorf

Introducing more people Introducing more people Introducing more people Introducing more people

Solve the paradox Less Downtime More Security LinuxCon Berlin, Germany October 4, 12:10

Upgrade or Migrate Your PostgreSQL Database With The Least Possible Downtime Avinash Vallarapu

Winning Presentation in a Day Get It Done Right, Get It Done Fast Winning Presentation in a Day

Winning Presentation in a Day Get It Done Right, Get It Winning Presentation in a Day Get It Done

Winning Cores in Parity Games Steen Vester DTU Compute October 22, 2015 S. Vester (DTU Compute)

Ghostferry: the swiss army knife of live data migrations with minimum downtime Shuhao Wu Shopify

The Magic of a Winning Presentation Len Elder The Magic of a Winning Presentation Presented By

Think Outside the Bach Engage Students with these Exciting Tools . . . Clinton Pratt, NCTM

Knowing the User's Every Move User Activity Tracking for Website Usability Evaluation and

The Measurement Manager Modular End-to-End Measurement Services Ph.D. Research Proposal

Frequency Duration Intensity Distress

Moving fast at scale Experience deploying IETF QUIC at Facebook Subodh Iyengar Luca Niccolini

Ideas for finding UV from streamers in ProtoDUNE Ideas, drawings, photos, etc. by Francesco

AmazingStore: Available, Low-cost Online Storage Service Using Cloudlets Ben Y. Zhao Zhi Yang,

Second Quarter 2019 Earnings Presentation August 1, 2019 www.ussteel.com Forward-looking