99.99% Uptime at 175 TB of Data Per Day Ben John CTO - PowerPoint PPT Presentation

Sep 07, 2023 •639 likes •898 views

99.99% Uptime at 175 TB of Data Per Day Ben John CTO bjohn@appnexus.com Matt Moresco Software Engineer, Real Time Platform mmoresco@appnexus.com Web page Cookiemonster Impbus Bidder Batches External bidders ~120ms Packrat Data

99.99% Uptime at 175 TB of Data Per Day Ben John CTO bjohn@appnexus.com Matt Moresco Software Engineer, Real Time Platform mmoresco@appnexus.com
Web page Cookiemonster Impbus Bidder Batches External bidders ~120ms Packrat Data pipeline
Web page Cookiemonster Impbus Bidder Batches External bidders Packrat Data pipeline
Managing failure Prevent it in the fi rst place Unit/Integration tests Canary releases When it happens, recover quickly
Ways we fail Data distribution unreliability C woes DDOSing ourselves
Handling bad data Good news: our systems deliver object updates to thousands of servers around the world in under two minutes! Bad news: our systems can deliver crashy data to thousands of servers around the world in under two minutes!
Handling bad data Validation engines: run a copy of the production app, Batches Impbus ✅ see if it crashes before distributing data globally This can still fail in bad ways: VE version not aligned with production Time-based crashes Impbae
Handling bad data Feature switches: AN_HOOK Roll back time! Prevent distribution past a timestamp
C woes No exceptions in C! core_me_maybe Catch signal, throw out request, return to event loop Flipped o ff on some instances so we can get a backtrace
Packrat Home grown data router Transform, bu ff er, compress, forward Transformations: message format, sharding, sampling, fi ltering Message formats: protobuf, native x86 format, json (Rolling your own serialization format is probably a bad idea) High volume disk throughput Guaranteed message delivery
Packrat Topology Amsterdam LA NY Frankfurt Singapore
Packrat protocol Group by like type HTTP post Batch Prefer to send full bu ff ers Fall back to 10s limit Snappy compress everything
Packrat failure handling Request fails: write it to disk repackd separate process running on the instance that will continually read failed rows from disk, retry sending them if the retry fails, write to disk, do it all again Prone to nasty failure scenarios
Bad data If a schema evolution diverges in prod, we will crash Because of our failure handling mechanisms, a single bad message can machine gun an entire datacenter
Packrat failure handling Because we bu ff er data in outing requests, we send back a 200 OK before the a message is sent downstream or written to disk What about data in memory when packrat crashes? 🤕
Packrat failure handling Write-ahead log: write every (compressed) incoming request to disk for a 5 minute window On startup, replay all tra ffi c (because we don't care about duplicates)
Lessons learned If you're going to crash, do everything you can to limit its scope Use every possible feature of your environment to your advantage Have clear points of responsibility hando ff Find a way to replicate prod, even if it means testing in prod

Recommend

How we run a 99,5% uptime SDI using Geoserver Roel Huybrechts, Niels Charlier, Timothy De Bock et.

How we run a 99,5% uptime SDI using Geoserver Roel Huybrechts, Niels Charlier, Timothy De Bock et. al. Data tabank Ondergrond Vla laanderen OPEN DATA geothermics soil geology FLANDERS IN INSPIRE groundwater geotechnics mineral

645 views • 31 slides

Triplex Chapter of Vibration Institute Wireless Monitoring of Critical Assets Jim Girardeau

Triplex Chapter of Vibration Institute Wireless Monitoring of Critical Assets Jim Girardeau Uptime Solutions jgirardeau@uptime-solutions.us www.uptime-solutions.us January 14, 2014 1 Uptime Solutions 2013 Agenda Goals of Wireless

509 views • 28 slides

MicroBooNE status Pawel Guzowski The University of Manchester DAQ uptime POT-weighted DAQ

MicroBooNE status Pawel Guzowski The University of Manchester DAQ uptime POT-weighted DAQ uptime Delivered POT 1 2 Uptime fraction 18 Delivered POT x10 1.8 0.95 1.6 1.4 0.9 1.2 1 0.85 0.8 0.6 0.8 0.4 0.2 0.75 0 Sep 22 Sep

209 views • 6 slides

MicroBooNE Status David Martinez Illinois Institute of Technology AEM Meeting 01/30/17 1 DAQ

MicroBooNE Status David Martinez Illinois Institute of Technology AEM Meeting 01/30/17 1 DAQ Uptime 90.3% DAQ Uptime BNB Uptime 96.53% POT Delivered 5.1E20 (6.8E18 this week) POT Recorded 4.9E20 (6.1E18 this week) David Martinez,

90 views • 5 slides

Through Tubing Conveyed ESP Effective Pump Swap Maximizing Production and Well Uptime John

Through Tubing Conveyed ESP Effective Pump Swap Maximizing Production and Well Uptime John Algery Europe and Africa Region Manager EuALF 2018 European Artifical Lift Forum Retrievable ESP History and Evolution Excessive Sand Production

252 views • 14 slides

Energy Systems, Uptime and the Digital Economy Chicago EDA Conference June 23, 2020 Andrew R.

Energy Systems, Uptime and the Digital Economy Chicago EDA Conference June 23, 2020 Andrew R. Thomas Mark Henning Energy Policy Center Levin College of Urban Affairs Cleveland State University 1 En Ener ergy Syst stem em Implica

603 views • 16 slides

Uptime at IXPs - and NIS Directive Robert Lister UKNOF 40 27 April 2018 | Manchester NIS

Uptime at IXPs - and NIS Directive Robert Lister UKNOF 40 27 April 2018 | Manchester NIS Directive EU Directive on security of Networks and Information Systems UK Consultation: (August/Sept 2017):

635 views • 28 slides

Contract Solutions By Almasa Kljako 1 May 25, 2013 Agilent Support Contract Value Statement

Agilent Support Contract Solutions By Almasa Kljako 1 May 25, 2013 Agilent Support Contract Value Statement Faster mean time to repair Uptime assurance World class system uptime support Lower cost of ownership Improves time-to-market

317 views • 14 slides

Performance financing for reliable last mile rural water access 28 August 2019 | World Water Week

Performance financing for reliable last mile rural water access 28 August 2019 | World Water Week About Uptime Uptime is a global consortium working to deliver drinking water services to millions of rural people through long-term,

483 views • 46 slides

(DCGM) Brent Stolle and Rajat Phull, 4/5/2016 DATA CENTER INFRASTRUCTURE CHALLENGES Resource

April 4-7, 2016 | Silicon Valley DATA CENTER GPU MANAGER (DCGM) Brent Stolle and Rajat Phull, 4/5/2016 DATA CENTER INFRASTRUCTURE CHALLENGES Resource Availability & Uptime Under-utilized Resources & Efficiency Administrative Overhead

403 views • 39 slides

Move to High Availability with Hyper-v What is it and Why do you need it? Refers to a systems

Move to High Availability with Hyper-v What is it and Why do you need it? Refers to a systems continuous uptime Reduces risk of data loss Handles failures of many components Minimises business impact One bad customer

446 views • 15 slides

Needs Summary Table (Gap Analysis) Assure high levels of critical system and equipment uptime 1

Needs Summary Table (Gap Analysis) Assure high levels of critical system and equipment uptime 1 Practices Summary Table (Gap Analysis) 3 6 29 34 1 19 24 9 14 2 4 20 7 15 30 35 10 25 5 8 11 21 26 31 36 16 27 37 12 22

479 views • 5 slides

English for presentation pdf mp3 Download English for presentation pdf mp3 Intel PRO1000 LAN

English for presentation pdf mp3 DownloadEnglish for presentation pdf mp3. Free Download e-Books rar file is about 147 MB. System Uptime 4 11 2014 12 48 19 write Pictures Video taken with your 1020 - since i saved 129 with the education discount

611 views • 5 slides

for High Availability Martin Thompson - @mjpt777 What Is High Availability ?

Event Sourced Architectures for High Availability Martin Thompson - @mjpt777 What Is High Availability ? Availability refers to ability of the user community to access a system not about Uptime! By High availability we

814 views • 19 slides

Top 10 Areas for Industry Attention in 2019 Matt Stansberry MODERATED BY: VP North America

PANEL DISCUSSION Top 10 Areas for Industry Attention in 2019 Matt Stansberry MODERATED BY: VP North America Uptime Institute Event Sponsor PANELISTS Rhett Bailey Jennifer Lauria Clark, CPIP Sr. Vice President, Mission Critical Executive

365 views • 23 slides

Database Stalls, From the Ordinary to the Obscure Preetam Jinka (@PreetamJinka) Software

Database Stalls, From the Ordinary to the Obscure Preetam Jinka (@PreetamJinka) Software Engineer Percona Live 2017 VividCortexs database monitoring application is the best way to improve your database performance, efficiency, and uptime.

573 views • 36 slides

Grape growing crawls to the North Situation in Finland Ari Markkula ari.markkula@omenakumpu.com

Cold climat seminar Heppenheim 2010 Grape growing crawls to the North Situation in Finland Ari Markkula ari.markkula@omenakumpu.com Length of the day: Sun up time Sun uptime 21:36 19:12 16:48 14:24 Paris 12:00 Houres Helsinki 9:36

157 views • 11 slides

Technology Services FY 2018-19 Budget Presentation Technology Services Goals and Accomplishments

Technology Services FY 2018-19 Budget Presentation Technology Services Goals and Accomplishments Overall, completed 74 projects, fulfilled 8,676 service/incident requests, and held uptime of 99.96% Application Consolidation/Reduction

374 views • 15 slides

Running a High Performance LAMP stack on a $20 Virtual Server Friday, July 20, 12 Simplified

Running a High Performance LAMP stack on a $20 Virtual Server Friday, July 20, 12 Simplified Uptime Started a side-business selling customized hosting to small e-commerce and other web sites Spent a lot of time optimizing for RAM utilization

572 views • 27 slides

StatusCake Provider The StatusCake provider allows Terraform to create and congure tests in

StatusCake Provider The StatusCake provider allows Terraform to create and congure tests in StatusCake (https://www.statuscake.com/). StatusCake is a tool that helps to monitor the uptime of your service via a network of monitoring centers

466 views • 3 slides

Performance Checklists for SREs Brendan Gregg Senior Performance Architect

Performance Checklists for SREs Brendan Gregg Senior Performance Architect Performance Checklists per instance: cloud wide: 1. uptime 1. RPS, CPU 2. Volume 2. dmesg -T | tail 3. Instances

1.07k views • 79 slides

Data Preparation Data cleaning Data integration and transformation (Data

Data Preprocessing Why preprocess the data? Data Preparation Data cleaning Data integration and transformation (Data preprocessing) Data reduction, Feature selection Discretization and concept hierarchy generation 2

891 views • 21 slides

Observability to Better Apps Erica Windisch CTO & Co-founder, IOpipe Is your application

Observability to Better Apps Erica Windisch CTO & Co-founder, IOpipe Is your application working for your users? Are you sure? Define working. UP != online UP == useful Uptime? What if your app is always up? Your

587 views • 44 slides

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data mining is the use of efficient techniques for the analysis of very large collections of data and the extraction of useful and possibly unexpected

2.4k views • 94 slides